The Maintenance Crisis: Why AI Agents That Fix Bugs Create More Bugs

SWE-bench measures whether an agent can fix a bug. It does not measure whether that fix survives the next commit. Or the commit after that. Or the sixty-nine commits that follow over 233 days of real software evolution.

A new benchmark called SWE-CI does measure this. The results should concern anyone treating AI agent output as production-ready code.

The Longitudinal Evidence

Jialong Chen and colleagues built SWE-CI (arXiv 2603.03823v1, March 2026) to answer a question no existing benchmark asks: what happens to an agent’s fix over time?

The setup: 100 tasks drawn from 68 Python repositories. Each task spans an average of 233 days and 71 consecutive commits. Eighteen models from eight providers. Over 10 billion tokens consumed in evaluation. This is not a snapshot. It is a time-lapse.

The metric that matters is the zero-regression rate: what fraction of an agent’s fixes introduce no regressions in the tests that were already passing? Most models scored below 0.25. Meaning 75% or more of their fixes broke something else.

Only one model exceeded a 50% zero-regression rate: Claude Opus.

The researchers’ diagnosis is precise: “Agents are local optimizers. They don’t model how their change interacts with tests currently passing.” An agent asked to fix bug #47 will fix bug #47. It will not consider that its fix changes the behavior of functions used by tests #12, #31, and #55. Those tests pass today. They will fail tomorrow.

Why One-Shot Benchmarks Miss This

SWE-bench evaluates isolated fixes. An agent gets a bug report, a codebase snapshot, and a task: produce a patch. If the patch resolves the issue, it passes. The benchmark never asks what happens next.

In The AI Verification Debt, we documented the structural mismatch between how fast organizations deploy AI-generated code and how poorly they verify it. SWE-CI adds a temporal dimension to that argument. Verification debt is not just about the code you shipped today. It includes the regressions that code will produce next week, next month, and six months from now when someone refactors a related module.

The SWE-CI authors put it simply: “Real codebases evolve. A benchmark that evaluates isolated, one-shot fixes can’t see any of that.”

This is the difference between a doctor who treats a symptom and one who monitors the patient. The symptom might resolve. The treatment might create a new condition. You cannot evaluate medical care by checking whether the patient felt better on Tuesday.

The 20,171x Number

A separate data point sharpens the argument. Hōrōshi, writing on katanaquant.com (March 2026), documented a Rust rewrite of SQLite. The project produced 576,000 lines of code. On primary key lookups, the rewrite was 20,171 times slower than the original C implementation. A query that took 0.09 milliseconds in C took 1,815 milliseconds in Rust.

The bug was specific and instructive. The query planner missed the is_ipk flag, forcing full table scans on operations that should have been direct lookups. INSERT operations without explicit transactions ran 1,857x slower. UPDATE and DELETE operations exceeded 2,800x slowdowns.

The code compiled. It passed its tests. It produced correct results. It was, by every surface metric, functional software. It was also unusable.

In Cheap Code, Expensive Quality, we established that the cost of producing code has collapsed while the cost of verifying code quality has not. The SQLite rewrite is the extreme case of that asymmetry. Producing 576,000 lines is now cheap. Detecting that a single missed flag causes a five-order-of-magnitude performance regression requires the kind of deep system knowledge that no current agent possesses.

Hōrōshi’s observation cuts to the core: “LLMs optimize for plausibility over correctness. The code compiles. It passes all its tests. But it is not correct.”

Plausible Code as a Governance Problem

The word “plausible” does the heavy lifting here. LLMs generate code that looks correct. It follows patterns. It uses appropriate APIs. It handles edge cases that appear in training data. The output passes the eye test and often passes the test suite.

But plausibility is not correctness. A plausible query planner routes queries. A correct query planner routes queries through the right index. The difference is invisible until you measure performance under load, and by then the plausible version is in production.

DORA’s 2024 research quantified this at scale: every 25% increase in AI adoption correlates with a 7.2% decrease in delivery stability. Mercury’s benchmark data shows roughly 65% correctness for AI-generated code, dropping below 50% when efficiency is a requirement.

These numbers describe the same phenomenon from different angles. AI agents produce code that works in isolation. That code degrades system-level properties (performance, stability, maintainability) because the agent has no model of the system. It only sees the function it was asked to modify.

Clay Bugattis

Marius Horatau, writing on uphack.io (March 2026), introduced an image that sticks: “clay Bugattis.” AI makes it dramatically cheaper to produce software that appears to work. The exterior is indistinguishable from a real product. The interior cannot survive first contact with real traffic, real data, or real time.

His framing clarifies the maintenance crisis. “Writing code is creating order. Software engineering is fighting entropy.” An agent can create order in one function, one file, one patch. It cannot fight the entropy that its own changes introduce into the broader system.

Consider Google Search. Two simple input pages. Tens of thousands of engineers. The complexity is not in the interface. It is in the system that makes the interface possible. A clay Bugatti version of Google Search would be trivially easy to build. It would return results. The results would be terrible. The difference between the clay version and the real version is fifteen years of entropy management.

As we examined in The Phase Shift in Software Engineering, the bottleneck has moved from writing code to directing and reviewing it. SWE-CI reveals a third bottleneck that sits downstream of both: maintaining code that was written by an agent that will not be present for the next seventy commits.

The Maintenance Asymmetry

Human developers who write code carry context forward. They remember why they made a design choice. They recognize when a new requirement conflicts with an old assumption. They refactor proactively because they feel the weight of accumulated decisions.

Agents have no memory between invocations. Each fix is a fresh start. The agent that patches bug #47 today has no knowledge of bugs #1 through #46, no awareness of the architectural constraints that shaped the original design, and no ability to anticipate bugs #48 through #100 that its change might trigger.

This creates an asymmetry that compounds. Every agent-generated fix is locally optimal and globally uninformed. Over 71 commits and 233 days, the SWE-CI data shows this compounds into a regression rate that would be unacceptable in any serious engineering organization.

The 75%+ regression rate is not a model quality problem. Better models will reduce it. Claude Opus already demonstrates that. But the structural issue remains: an agent that cannot model system-level consequences will always introduce regressions at a rate proportional to system complexity. Better models will lower the constant. They will not eliminate the function.

What This Means for Governance

Organizations using AI agents for bug fixes face a choice they may not realize they are making.

Option one: treat every agent-generated patch as a draft that requires human review for system-level consequences. This works. It is also expensive enough to erode most of the productivity gains that justified adopting agents in the first place.

Option two: build regression detection infrastructure that catches the failures agents introduce. Continuous integration exists. But CI only catches regressions in the tests you have. The SWE-CI data shows agents break tests that exist. In production, many system-level properties (performance, memory usage, latency distributions) are not covered by tests at all.

Option three: accept the regression rate and plan for it. Budget engineering time for fixing the fixes. This is honest but self-defeating. You adopted agents to reduce engineering burden. If agents generate maintenance work proportional to their output, the net productivity gain approaches zero.

None of these options is comfortable. The discomfort is the signal.

Hōrōshi makes the sharpest version of this point: “LLMs are dangerous to people least equipped to verify their output.” The organizations most eager to adopt AI agents for maintenance are precisely those with the thinnest senior engineering capacity to catch what agents get wrong.

The Bar for Longitudinal Benchmarks

SWE-CI is the first benchmark to measure what matters for production use: not whether an agent can fix a bug, but whether that fix holds up over time. The answer, for most current models, is that it does not.

This does not mean agents are useless for maintenance. It means the industry has been measuring the wrong thing. A benchmark that tests one-shot fixes tells you about capability. A benchmark that tests fixes over 233 days tells you about reliability. Capability without reliability is a governance problem, not a technology problem.

The data is clear. The question is whether organizations will update their adoption models to reflect it, or continue treating one-shot benchmark scores as evidence of production readiness.

Software that works today and breaks tomorrow is not working software. It is deferred failure with good marketing.

This analysis synthesizes SWE-CI: The Continuous Integration Benchmark for AI Coding Agents (Chen et al., March 2026), LLM Plausible Code (Hōrōshi, March 2026), and The Illusion of Building (Horatau, March 2026).

Victorino Group builds the governance infrastructure that catches what agents miss before it reaches production. Let’s talk.