The Four Failure Modes of Ungoverned AI Coding

TV
Thiago Victorino
8 min read
The Four Failure Modes of Ungoverned AI Coding
Listen to this article

We have covered the institutional evidence: Amazon outages, Anthropic’s own bugs, mandated adoption backlash. We have covered the research: SWE-CI’s 75%+ regression rates, cognitive debt accumulating beneath clean code. What we have not had is a practitioner’s taxonomy. A field-tested classification of how, specifically, AI agents degrade the codebases they touch.

Mario Zechner, a game developer and engineer, published that taxonomy this week. His essay “Thoughts on Slowing the Fuck Down” identifies four failure modes that emerge when AI coding agents operate without governance. The taxonomy is not academic. It comes from watching codebases rot in real time.

Each failure mode maps to a specific governance control. Zechner diagnoses the disease. This article prescribes the treatment.

Mode 1: Error Compounding

Zechner’s observation is blunt: “An agent has no such learning ability. It will continue making the same errors over and over again.”

A human developer who writes a flawed database query learns from the code review. The next query is better. A junior engineer who mishandles error propagation once will handle it correctly next time, because the correction creates a memory. Agents do not form memories across sessions by default. The same mistake in session one becomes the same mistake in session forty.

This is true, but incomplete. Modern agent frameworks do have memory mechanisms. Claude Code supports persistent instructions via CLAUDE.md files, skills that encode learned patterns, and memory systems that persist across sessions. The question is whether teams actually configure these systems. Most do not. The default is amnesia, and defaults determine outcomes at scale.

The governance control: codified standards. Not style guides that sit in a wiki. Machine-readable rules that the agent consumes before every session. If your agent cannot reference your team’s architectural decisions, naming conventions, and known failure patterns before it writes a single line, you are running it without institutional memory. The discipline is not in the tool. It is in the configuration.

Mode 2: Removed Bottlenecks

“You have removed yourself from the loop, so you don’t even know that all the innocent booboos have formed a monster of a codebase.”

This is the most dangerous failure mode because it feels like success. The agent produces code faster than any human team. Pull request counts climb. Sprint velocity looks spectacular. Leadership celebrates.

Meanwhile, small decisions compound. An unnecessary abstraction layer here. A duplicated utility there. A data model that works for today’s requirements but creates a migration nightmare for tomorrow’s. No single decision is catastrophic. But as we explored in The Evidence Is In: AI Coding Agents Are Breaking Things, the pattern is consistent across organizations: the tool sold as a force multiplier becomes a force divider once the accumulated decisions start interacting.

The bottleneck was never a bottleneck. It was a checkpoint. Human review catches the decisions that are locally reasonable but globally destructive. Remove the checkpoint and you get throughput. You also get a codebase that nobody fully understands. As we explored in Cognitive Debt, this is institutional ignorance that passes every automated check.

The governance control: mandatory human review for architectural decisions. Not every line of code needs human eyes. But every decision that changes interfaces, data models, dependency structures, or service boundaries does. The control is a review gate, not a review blanket. Define which categories of change require human approval and enforce that boundary in your CI pipeline.

Mode 3: Complexity Accumulation

Agents are local optimizers. Given a task, they find a solution that works for that task. They do not ask whether the solution is consistent with the system’s existing patterns. They do not ask whether a simpler approach exists elsewhere in the codebase that they could reuse. They do not ask whether their solution creates maintenance burden for future changes.

The SWE-CI benchmark confirmed this empirically: 75%+ of agent-generated fixes introduce regressions because agents optimize for the immediate task without modeling downstream effects. Zechner describes the same phenomenon from the practitioner’s side. Each agent session adds complexity that is rational in isolation and irrational in aggregate.

The result is a codebase that grows in every dimension. More files. More abstractions. More indirection. More surface area for bugs. A human team under pressure will sometimes step back and refactor. An agent never steps back. It only moves forward, one local optimum at a time.

The governance control: complexity budgets and architectural review. Set measurable thresholds: maximum file count per module, maximum dependency depth, maximum cyclomatic complexity per function. When the agent’s output exceeds a threshold, it triggers architectural review before merge. The budget forces the conversation that the agent will never initiate on its own.

Mode 4: Recall Degradation

“The bigger the codebase, the lower the recall. Low recall means that your agent will not find all the code it needs.”

This is the failure mode that worsens over time. As the codebase grows (accelerated by modes 1 through 3), the agent’s ability to find relevant existing code decreases. It cannot hold the entire codebase in context. So it searches, and search is lossy. It misses the utility function on line 847 of a file it never opened. It misses the configuration constant defined in a module it did not know existed. It rewrites what already exists, because it cannot recall what already exists.

The irony is structural. The more code the agent generates, the harder it becomes for the agent to work with that code. The tool degrades its own operating environment.

The governance control: context management infrastructure. Curated documentation that the agent can reference. Code organization that makes discovery easier. Indexing systems that surface relevant code before the agent starts writing. A codebase kept small enough, and organized enough, that recall stays high. Some teams solve this with well-maintained README files and module-level documentation. Others use retrieval-augmented approaches. The mechanism matters less than the discipline: someone must own the agent’s context, or the context will decay alongside the code.

The Taxonomy Is Incomplete, and That Is Fine

Zechner’s four modes are not exhaustive. They do not account for agents that operate with memory systems, context management, and human-in-the-loop configurations. They do not address the counterexamples: Stripe processes 1,300 agent-generated pull requests per week with low defect rates. Shopify has scaled agent usage across its engineering organization without the catastrophic decay Zechner describes.

The difference between the success stories and the failure stories is not the tool. It is the governance. Stripe did not hand agents the keys and walk away. They built review infrastructure, defined boundaries, and maintained human oversight at decision points that matter.

Zechner’s taxonomy is valuable precisely because it names what goes wrong in the absence of those controls. It is a failure taxonomy, not a tool taxonomy. The modes describe ungoverned behavior. Govern the behavior and the modes become manageable.

What This Means for Engineering Leaders

The four failure modes are not independent. They reinforce each other. Error compounding produces bad code. Removed bottlenecks let bad code through. Complexity accumulates from bad code that got through. Recall degrades as complexity grows. The cycle accelerates.

Breaking the cycle requires governance at each stage. Not governance as bureaucracy. Governance as engineering discipline: codified standards, review gates, complexity budgets, context management. Four controls for four failure modes.

The organizations that treat AI coding agents as autonomous developers will experience all four modes. The organizations that treat them as powerful tools requiring operational discipline will not. The taxonomy makes the choice concrete.


This analysis synthesizes Thoughts on Slowing the Fuck Down (March 2026) and supporting evidence from Financial Times and TechCrunch reporting on AI code adoption at Amazon and Microsoft.

Victorino Group helps enterprises govern AI development at scale. Let’s talk.

All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation