The AI Control Problem

Cheap Code, Expensive Quality

TV
Thiago Victorino
7 min read

Simon Willison published the first chapter of his Agentic Engineering Patterns guide on February 23, 2026. The title says everything that needs saying: “Writing code is cheap now.”

The co-creator of Django is not making a prediction. He is describing an accomplished fact. Coding agents have, in his words, “dramatically dropped the cost of typing code into the computer.” A few hundred lines of clean, tested code used to cost a developer a full working day. Now it costs a prompt and ten minutes of waiting.

But Willison immediately follows with the sentence that matters more: “Delivering new code has dropped in price to almost free… but delivering good code remains significantly more expensive than that.”

This is the most precise economic framing of the current moment in AI-assisted development. And it leads directly to a governance problem that most organizations have not yet recognized.

What “Good” Actually Costs

Willison defines good code through a list that functions as an inadvertent audit of everything AI cannot cheaply verify:

  • Functional correctness without bugs
  • Verified confirmation of fitness for purpose
  • Solves the right problem
  • Graceful error handling with informative messages
  • Simple and minimal
  • Comprehensive test coverage
  • Current, accurate documentation
  • Future-change-friendly while respecting YAGNI
  • Non-functional qualities: accessibility, testability, reliability, security, maintainability, observability, scalability, usability

Read that list again. Every item requires human judgment. “Solves the right problem” requires understanding what the right problem is. “Future-change-friendly” requires knowing what changes are coming. “Simple and minimal” requires recognizing when code is doing too much — which requires understanding the system it lives in.

A coding agent can generate code that passes tests. It cannot determine whether the tests test the right things. It can produce documentation. It cannot assess whether the documentation matches what the code actually does in production. It can implement error handling. It cannot judge whether the error messages make sense to the humans who will read them at 3 AM during an incident.

The gap between “produces code” and “delivers good code” is not a tooling gap. It is a judgment gap. And judgment is the one thing that did not get cheaper.

The Individual Trap

Willison’s practical recommendation is characteristically pragmatic: “Any time our instinct says ‘don’t build that, it’s not worth the time’ fire off a prompt anyway, in an asynchronous agent session where the worst that can happen is you check ten minutes later and find that it wasn’t worth the tokens.”

For an individual developer — especially one as experienced as Willison — this works. You generate cheaply, evaluate with expertise, discard what doesn’t work. The cost of exploration dropped to nearly zero. The cost of evaluation stays constant but is amortized against more options.

The problem appears at organizational scale.

When every developer in an organization can generate code at near-zero cost, the volume of code requiring evaluation multiplies. Review queues grow. Merge conflicts increase. Integration testing becomes the bottleneck. The organization has made the cheap part cheaper and congested the expensive part.

LinearB’s analysis of 8.1 million pull requests found that AI-generated PRs had a 32.7% acceptance rate versus 84.4% for manually written ones. This is the quality gap expressed in operational data. The code is being generated. It is not being accepted. The bottleneck moved exactly where Willison’s framework predicts it would — to verification.

The Missing Governor

Here is the insight Willison’s framing enables but doesn’t explicitly state: when code was expensive to produce, the expense itself functioned as a natural governor.

You thought before you coded because coding was time-consuming. You designed before you implemented because implementation was the scarce resource. You didn’t build unnecessary features because building was costly. The cost of production imposed discipline.

That governor is gone.

When production is nearly free, every impulse becomes a pull request. Every idea becomes a prototype. Every side thought becomes a branch. The codebase expands not because the organization needs more code, but because producing more code has no natural friction.

Sonar’s 2026 survey found 96% of developers don’t trust AI code accuracy. This is not irrational skepticism — it is the rational response to a system that generates artifacts faster than anyone can verify them. The developers are telling you: the governor is gone, and nothing has replaced it.

What Replaces the Governor

Willison draws the line between “agentic engineering” — professionals using agents with quality standards — and “vibe coding” — generating code with no attention to quality. This is the right distinction, but it frames it as an individual choice.

At the organizational level, the distinction is governance.

The organizations that will succeed with AI-assisted development are not the ones that generate the most code. They are the ones that build verification infrastructure proportional to their generation capacity. This means:

Quality gates that scale with volume. If your developers can generate 10x more code, your review process needs to handle 10x more evaluations — either through automation (type systems, static analysis, architectural fitness functions) or through smarter triage (risk-tiered review, specification-driven acceptance).

Evaluation as a first-class discipline. Not an afterthought. Not “someone will review it.” A systematic practice with staffing, tooling, and metrics. Willison’s “good code” checklist is the specification. The organization needs the operational infrastructure to verify against it.

Cost accounting that includes verification. The token cost of generating code is nearly zero. The organizational cost of evaluating, integrating, maintaining, and eventually deprecating that code is not. Any AI productivity calculation that measures output without measuring verification overhead is lying to itself.

Willison is right that code is cheap. The organizations that treat quality as equally cheap will learn the difference at production time.


This analysis is based on Simon Willison’s “Writing code is cheap now” (February 23, 2026), the opening chapter of his Agentic Engineering Patterns guide, with supporting data from LinearB’s 2026 engineering benchmarks and Sonar’s State of Code 2026 survey.

Victorino Group helps engineering teams build the verification infrastructure that makes cheap code trustworthy. The problem is not that AI writes too much code. The problem is that your organization has no system to determine how much of it is good. Let’s fix that.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation