The AI Control Problem

The Evidence Is In: AI Coding Agents Are Breaking Things

TV
Thiago Victorino
9 min read
The Evidence Is In: AI Coding Agents Are Breaking Things

For months, the AI coding debate has been theoretical. Optimists pointed to developer surveys and token counts. Skeptics cited vibes and intuition. Both sides were arguing from projections.

That phase is over. We now have incident reports.

Anthropic Cannot Debug Anthropic

Start with the most uncomfortable example. Claude.ai, the flagship product of the company building arguably the most capable coding agent on earth, had a persistent UX bug throughout early 2026. Typed prompts would vanish during page load. It affected every paying customer, every visit.

Nobody at Anthropic caught it.

Gergely Orosz, author of The Pragmatic Engineer, flagged it publicly on X in March 2026. Only then was it fixed. This is the same company where, by internal accounts, over 80% of code is generated by Claude Code.

The bug was not complex. It was not hidden in some obscure edge case. It was the primary interaction surface of the product, broken, in front of millions of users. The humans reviewing Claude’s output missed what any frustrated customer noticed immediately.

This is not an argument against AI code generation. It is a demonstration of what happens when verification infrastructure does not scale with generation capacity. As we explored in Cheap Code, Expensive Quality, the cost of producing code collapsed. The cost of knowing whether it works did not.

Amazon’s Thirteen-Hour Lesson

In December 2025, Amazon’s Kiro AI agent was tasked with modifying the AWS Cost Explorer environment. The agent deleted the environment and recreated it from scratch. The outage lasted thirteen hours.

By March 2026, Amazon’s retail division had experienced four major incidents, including a six-hour shopping outage. SVP Dave Treadwell wrote an internal memo that should be required reading for every engineering leader deploying AI agents: “GenAI tools supplement or accelerate production change instructions, leading to unsafe practices.”

Amazon’s response was telling. Junior and mid-level engineers now need senior sign-off for AI-assisted production changes. Fifteen hundred engineers protested a mandate requiring 80% weekly Kiro usage. Internal sources told The Pragmatic Engineer that AI tools produced “worse quality code, but also just more work for everyone.”

Read that last quote again. Not just worse code. More work. The tool sold as a force multiplier became a force divider.

The critical nuance here: these are governance failures, not tool failures. Kiro had operator-level permissions with no guardrails. No human-in-the-loop for destructive operations. No staged rollout. The agent was trusted like a senior engineer and supervised like nobody at all.

The Measurement Trap Materializes

We have been writing about measurement distortion for weeks. The evidence has now moved from surveys to corporate policy.

Meta made AI tool usage a formal part of performance reviews in 2026. Token consumption is tracked in calibrations. If you want to get promoted at Meta, you need to demonstrate that you are using AI. Whether that usage produces better outcomes is a separate question that the review system does not ask.

Uber’s CEO Dara Khosrowshahi told Bloomberg that power users (those on AI tools 20 or more days per month) demonstrate “productivity I’ve never seen before.” The metric behind that claim: pull request count. Not defect rates. Not time-to-resolution. Not customer impact. Pull requests.

As we analyzed in The AI Intensity Trap, measuring usage intensity creates perverse incentives. When the metric is activity volume, you get more activity. Whether that activity creates value is a question the metric cannot answer. And as we documented in McKinsey Measured the Wrong Thing, this perception mismatch is not noise. It is structural.

Khosrowshahi also hinted at replacing headcount with “agents and GPUs.” When the CEO frames AI as a headcount replacement and ties AI usage to performance reviews, engineers receive a clear signal: produce visible AI output or risk your job. Quality becomes secondary to demonstrable adoption.

The Perception Mismatch Gets a Replication

The METR study from July 2025 remains the most rigorous controlled trial of AI coding tools. Experienced developers were 19% slower with AI assistance. They believed they were 20% faster. A 39-point perception mismatch.

METR published an update in February 2026 with a newer cohort. The result: negative 4%, not statistically significant. METR itself called the original data “very weak evidence.” But notice what even the improved result shows. After months of additional AI tool development, the best controlled measurement available still cannot demonstrate a statistically significant speedup for experienced developers on mature codebases.

The feeling of speed is real. Dax Raad, creator of the OpenCode agent framework, put it precisely: “The productivity feeling is real. The productivity isn’t.” Raad added that well-organized codebases perform “dramatically better” with LLMs, and that sequential work with faster models outperforms parallel agents. Structure matters more than speed.

The Developer Verdict

Stack Overflow’s 2025 developer survey asked practitioners what frustrates them most about AI coding tools. The top answer: code that “looks correct but is slightly wrong.”

Nearly half of respondents said debugging AI-generated code takes longer than writing it from scratch.

One developer captured the shift in a metaphor that deserves to outlast the survey: the job changed “from craftsman making a perfect chair” to “factory manager of Ikea shipping low-quality chairs.”

That metaphor is precise because it identifies what actually changed. Not the skill of the worker. Not the capability of the tool. The relationship between the person and the output. When you write code, you understand it because the understanding preceded the writing. When you review generated code, understanding must be constructed after the fact, from an artifact you did not design.

The Solow Paradox and the Real Variable

Critics of this narrative have a valid counterpoint. IBM estimates $4.5 billion in AI-driven gains. Zapier runs 800+ internal agents. Success stories exist.

The Solow Paradox is also instructive. Robert Solow observed in 1987 that “you can see the computer age everywhere but in the productivity statistics.” It took a decade of workflow restructuring before computers showed measurable gains. AI coding tools may follow the same trajectory.

But the Solow Paradox has a precondition that gets overlooked: the productivity gains arrived only after organizations restructured around the technology. They did not arrive from bolting computers onto existing workflows and mandating usage. The restructuring was the work.

This points to the real variable. Not “AI versus no AI.” The real variable is governed AI versus ungoverned AI. Amazon’s outages came from ungoverned deployment. Anthropic’s bug came from ungoverned review. Meta’s measurement distortion comes from ungoverned incentives. Every failure in this article traces to the same root: organizations treated AI adoption as a procurement decision when it is an operational discipline.

What Governance Actually Looks Like

One solution is both obvious and underdeployed: deterministic quality gates.

Static code analysis (linters, complexity detectors, security scanners, dead code finders) can constrain AI-generated code to meet quality standards before any human reviews it. These tools are not new. They are not AI-dependent. They are the same instruments that governed human code for decades.

The irony is sharp. Organizations relaxed quality gates precisely when code became cheaper to produce, which is precisely when they needed those gates most. Cyclomatic complexity limits, dependency analysis, architectural fitness functions: these give AI agents hard boundaries. The agent writes the code. The linter tells it whether the code is good enough. No judgment required for the deterministic checks. Human judgment reserved for the questions that actually need it.

As we described in The Agent Operations Paradox, more agents create more operational work, not less. Deterministic gates are the only way to scale quality verification without scaling headcount proportionally.

The Reckoning Is Not About AI

Every piece of evidence in this article points to the same conclusion, and it is not “AI coding tools are bad.”

AI coding tools are powerful. They are also ungoverned. The organizations experiencing outages, quality degradation, and developer backlash are not the ones using AI. They are the ones using AI without controls, without verification infrastructure, without quality gates that match the volume of code being produced.

The evidence is no longer theoretical. It is in Amazon’s incident reports, Anthropic’s bug tracker, Meta’s performance review templates, and the Stack Overflow survey responses of hundreds of thousands of developers.

The question was never whether AI would change how we write software. It will. The question is whether organizations will build the governance to make that change productive, or whether they will keep measuring pull request counts while their production environments burn.


This analysis synthesizes Gergely Orosz’s reporting in The Pragmatic Engineer (March 2026), the METR AI developer productivity study (July 2025, updated February 2026), Stack Overflow’s 2025 Developer Survey (2025), Dax Raad’s commentary on AI coding productivity (March 2026), and Bloomberg’s interview with Uber CEO Dara Khosrowshahi (March 2026).

Victorino Group helps engineering organizations build the governance infrastructure that makes AI coding tools productive instead of destructive. Let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation