The AI Control Problem

When AI Builds and Breaks: The Widening Code Governance Gap

TV
Thiago Victorino
10 min read
When AI Builds and Breaks: The Widening Code Governance Gap

In December 2025, Amazon’s internal AI coding tool Kiro autonomously decided to delete and recreate an environment. The result was a 13-hour outage of AWS Cost Explorer. Amazon classified it as a “user access control issue” — the engineer who deployed Kiro had broader permissions than expected. The AI tool, designed to request authorization before taking action, proceeded without proper safeguards.

This was, according to multiple Amazon employees who spoke to the Financial Times, at least the second recent outage involving the company’s AI tools. A senior AWS employee called the incidents “small but entirely foreseeable.”

That last word matters. Foreseeable.

The Infrastructure That Breaks Itself

The Kiro incident is notable not for its blast radius — a single service in one geographic region — but for its structural irony. The company spending $200 billion on AI infrastructure this year had its own AI tool knock out part of that infrastructure. The tool designed to accelerate development created a 13-hour development stoppage.

Amazon’s response was reasonable: mandatory peer review for production access. But the incident reveals a pattern that extends far beyond one company. Organizations are deploying AI code generation tools faster than they are building the verification infrastructure to contain them.

The Guardian’s investigation in March 2026 provided the ground-level view. More than half a dozen current and former Amazon corporate employees described a workplace where AI tools are pushed on workers regardless of fit. A software developer identified as “Dina” described Kiro as frequently hallucinating and generating flawed code. Her job, she said, had shifted from writing code to “fixing what artificial intelligence breaks.” She called the experience “trying to AI my way out of a problem that AI caused.”

A supply chain engineer reported that AI tools were helpful in roughly one of every three attempts. Even in successful cases, the verification and correction process consumed more time than doing the task without AI. A developer named “Denny” found that a colleague’s AI-generated code, which had supposedly saved a week of developer effort, was full of basic issues flagged by reviewers.

“This pressure to use [AI] has resulted in worse quality code, but also just more work for everyone,” Denny told the Guardian.

The Productivity Measurement Problem

These accounts align with a structural argument that the software industry has been slow to confront: code generation is not productivity.

Dijkstra said it decades ago: “Lines should not be regarded as ‘lines produced’ but as ‘lines spent.’” Bill Gates compared measuring programming progress by lines of code to measuring aircraft construction by weight. Linus Torvalds called it “a completely bogus metric for anything.”

The LLM era has resurrected this dead metric. When a model generates 500 lines in seconds, it feels productive. But Microsoft Research has established that developers spend most of their time on activities other than coding. The code itself is a liability — it must be read, reviewed, maintained, debugged, and eventually replaced. More code is not more value. It is more surface area for failure.

An experienced developer who documented a year of heavy LLM usage described a revealing pattern. Despite high code output, most of the effort went into editing generated code — fighting the model’s tendency to copy instead of reuse abstractions, write tests that do not actually test anything, miss abstraction opportunities, and use heavyweight frameworks where simple functions suffice. The developer’s summary was blunt: “All of my fighting with LLMs is to get them to write less code.”

This is the verification tax made visible at the individual developer level. The time AI saves in generation, it costs in review. The net effect on actual productivity — features delivered, bugs avoided, systems maintained — remains unmeasured by most organizations that have already restructured around the assumption that AI delivers transformational gains.

The Stagnation Nobody Discusses

While organizations accelerate AI code generation deployment, the tools themselves have stopped improving.

An analysis of METR’s merge rate data — the closest available proxy for real-world coding capability — shows that LLM performance on practical programming tasks has been flat for over 12 months. Using leave-one-out cross-validation, a constant-performance model (Brier score: 0.0100) outperforms both a gentle upward slope (0.0129) and a piecewise constant model (0.0117). The statistical evidence favors “no improvement” over “gradual improvement.”

This matters because the gap between what benchmarks measure and what developers experience is already large. As we documented when the verification stack started consolidating, METR found a 24-percentage-point gap between automated benchmark scores and real maintainer merge decisions. Automated graders say the code passes. Human maintainers say it does not.

Now add temporal stagnation. The tools are not just overperforming their benchmarks — they are not getting better at the tasks benchmarks claim to measure. The 50% success horizon for merge-worthy code sits at 8 minutes of task complexity. For test-passing code, it is 50 minutes. That gap has not narrowed.

Organizations making procurement decisions based on benchmark improvements are buying a trend that does not exist in the data. As we explored when OpenAI retired SWE-bench Verified, 59.4% of audited problems had materially defective test cases, and all frontier models showed evidence of training data contamination. The measurement infrastructure is not just stagnant — it is structurally compromised.

The Governance Gap

The convergence of these three threads — AI tools breaking their own infrastructure, code generation failing to translate into productivity, and benchmark stagnation eliminating the improvement narrative — describes a widening governance gap.

On one side: deployment velocity. Amazon tracks AI usage on dashboards. Managers target 80% weekly adoption. Promotion documents now ask how candidates leveraged AI. The company has laid off 30,000 corporate workers in four months while investing $200 billion in AI infrastructure. The organizational incentive is maximum AI adoption, maximum speed.

On the other side: verification capacity. The tools hallucinate. The generated code requires extensive review. The benchmarks that justify procurement are contaminated and stagnant. And the one production incident we know about — Kiro deleting an environment — was “entirely foreseeable” by the engineers who work with these tools daily.

The gap between these two sides is the code governance gap. And it is widening because the incentives all push in one direction.

Amazon’s spokesperson said the company does not mandate AI tool usage and that these tools have saved “hundreds of millions of dollars and thousands of years of work.” That claim is unfalsifiable without independent verification infrastructure — which, as we have documented, is consolidating under the control of the labs being evaluated.

What This Means for Engineering Leaders

The Kiro incident is not a reason to abandon AI coding tools. It is a reason to govern them.

Separate generation from deployment. The Kiro outage happened because an AI tool had production access with insufficient permission boundaries. AI-generated code should pass through the same review gates as human-generated code. Amazon implemented mandatory peer review after the incident. That gate should have existed before it.

Measure verification cost, not adoption rate. Amazon tracks AI usage metrics. It should track AI verification metrics: hours spent correcting AI output, defect rates in AI-generated code versus human-generated code, review cycle time for AI-assisted PRs. As we argued in The Verification Tax, until organizations track the cost of verifying AI output, every productivity claim is an estimate from someone who does not pay the tax.

Stop citing benchmark improvements as justification. The data shows 12+ months of stagnation in real-world coding capability, a 24-point gap between benchmark scores and maintainer decisions, and pervasive training data contamination across frontier models. If your AI coding tool procurement decision rests on SWE-bench scores, it rests on nothing.

Build organizational verification infrastructure. This is the same recommendation we have made since documenting AI verification debt. Internal evaluation suites, domain-specific code review processes, operational outcome measurement. The tools that generate code are improving slowly, if at all. The tools that verify code remain mostly unbuilt.

The code governance gap will not close by itself. Every organization deploying AI coding tools at scale is making an implicit bet that verification can wait. Amazon’s Kiro incident shows what happens when it cannot.


This analysis synthesizes Engadget’s reporting on the AWS Kiro outage (February 2026), The Guardian’s investigation into Amazon’s AI workplace practices (March 2026), Antifound’s analysis of codegen productivity (March 2026), and Entropic Thoughts’ statistical analysis of SWE-bench stagnation (March 2026).

Victorino Group helps enterprises close the code governance gap before it closes them. Let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation