Engineering Notes

Your Codebase Already Has an AI Governance Layer. You Just Don't Know It.

TV
Thiago Victorino
9 min read

A recent Stack Overflow interview with Factory’s Eno Reyes surfaced an insight that the interview itself did not fully develop: the infrastructure your codebase already has --- or lacks --- determines whether AI agents accelerate your organization or amplify its technical debt.

This is not about buying new tools. It is about recognizing what your existing engineering practices actually are.

The Stanford Data

A Stanford study led by Yegor Denisov-Blanch, covering over 100,000 developers across 600+ companies, produced the clearest empirical finding on AI-assisted development to date: codebase quality is the strongest predictor of AI-driven productivity gains.

The numbers are more nuanced than the headline. AI tools boost simple greenfield tasks by 30—40%. For complex work in mature codebases, the gains drop to 0—10%. And here is the number that should concern every engineering leader: code maintainability declined 9% with AI usage.

That last figure is easy to gloss over. It shouldn’t be. It describes a feedback loop.

The Quality Degradation Spiral

This is the non-obvious insight that most commentary on the Stanford data misses.

If AI usage degrades code maintainability by 9%, and lower code quality reduces AI effectiveness, you have a compounding problem. Each cycle of AI-assisted development makes the next cycle slightly less productive. The degradation is small per iteration --- small enough to ignore in any given sprint --- but it compounds.

Imagine a codebase where AI tools are deployed without strong quality gates. The first quarter, AI-generated code ships faster but with slightly lower maintainability. The second quarter, the AI tools working on this now-slightly-degraded codebase produce marginally worse output. The third quarter, the degradation is noticeable enough that developers start complaining about code quality, but the AI productivity numbers still look good because nobody is measuring quality deterioration against the productivity baseline.

By the fourth quarter, you have a codebase that is measurably harder to work with --- for humans and AI alike --- and the productivity gains that justified the tooling investment have eroded.

This is not hypothetical. The Stanford data gives us the rate: 9% maintainability decline. The compounding math is straightforward. Organizations that deploy AI development tools without maintaining quality infrastructure are not staying in place. They are moving backward.

Gartner’s prediction that 40%+ of agentic AI projects will be scrapped by 2027 starts to look less like caution and more like arithmetic.

The Perception Gap

Before assuming that individual developers will self-correct this problem, consider the METR study from July 2025.

In a controlled trial with experienced open-source developers, those using AI tools were 19% slower on tasks --- while believing AI had sped them up by 20%. The perception-reality gap was nearly 40 percentage points.

This is not a commentary on the tools. It is a commentary on the reliability of subjective productivity assessment. If developers cannot accurately perceive whether AI is helping or hurting their individual productivity, they certainly cannot perceive a 9% decline in the maintainability of their collective output.

The quality degradation spiral is invisible to the people inside it. Which means it will not be caught by intuition, stand-ups, or retrospectives. It will only be caught by measurement --- by the same kind of automated, objective quality signals that engineers have been building for decades.

The Governance Layer You Already Have

Here is where the conversation gets interesting.

Reyes, in the Stack Overflow interview, describes what Factory calls “validation signals” --- the automated checks that determine whether agent-generated code meets quality standards. He lists them: compilation, linting, test passage, type checking, security scanning, documentation coverage, code complexity analysis.

Read that list again. It is not a list of new AI governance tools. It is a list of engineering practices that predate large language models by decades.

Linters enforce style and catch mechanical errors. Type checkers ensure structural correctness. Test suites verify behavior. Security scanners detect vulnerability patterns. Code complexity analyzers flag maintainability risks. Every item on that list exists in most mature engineering organizations already.

The reframe is this: organizations that invested in engineering discipline --- real investment, not just tooling adoption but cultural enforcement of testing, typing, linting, and review standards --- were building AI governance infrastructure before AI existed. They just didn’t know it.

And organizations that treated these practices as overhead, as nice-to-haves that slow down delivery? They are now discovering that AI exposes every shortcut they ever took.

A codebase with 90% test coverage, strict type checking, and enforced linting is ready for AI agents today. Not because someone designed it for AI. Because quality infrastructure is governance infrastructure. They were always the same thing.

The Interns Analogy, Examined

Reyes uses a vivid metaphor: deploying AI agents is not like hiring another engineer. It is like hiring a hundred intern-level engineers. You cannot code-review a hundred interns manually.

The analogy is useful but it smuggles a premise. It frames the problem as one of review throughput --- too many outputs, not enough reviewers --- which conveniently positions the solution as automated review tooling. Which is what Factory sells.

There is a different way to read the same analogy. If you have a hundred interns, the question is not “how do I review their output fast enough?” The question is “what environment do I put them in so that bad output is structurally impossible?”

You don’t review a hundred interns line by line. You give them a codebase with strict type safety so the compiler catches their mistakes. You give them a CI pipeline with comprehensive tests so broken behavior never reaches main. You give them linting rules so style arguments are settled by configuration, not discussion. You give them templates and conventions so the shape of correct code is obvious.

The governance is in the environment, not in the review step. The interns succeed not because someone reviews everything they write, but because the system they write within constrains their output toward correctness.

This is a more useful framing for AI agents than the review-throughput model. It shifts the investment from “buy tools that review AI output” to “strengthen the infrastructure that constrains AI output.”

Harness Engineering Is Not New

Reyes describes “harness engineering” --- the work of managing context windows, injecting environment information, handling tool calls --- as a distinct discipline that separates effective agent deployment from ineffective attempts.

The concept is real and important. But it is not Factory’s invention. Anthropic has published extensively on building effective agent harnesses. Aakash Gupta and Phil Schmid have written about it in different terms. The entire field of context engineering, which emerged in 2025, is essentially the same discipline applied more broadly.

What matters is not who coined the term but what it reveals: the hard part of using AI agents is not the model. It is the infrastructure around the model. The test suites that define success. The type systems that prevent structural errors. The linting rules that maintain consistency. The CI pipelines that enforce quality gates.

This is engineering infrastructure. It has been engineering infrastructure for thirty years. The fact that it now also serves as AI governance infrastructure is the insight --- and it belongs to no vendor.

What This Means Practically

Audit your quality infrastructure before deploying agents. If your test coverage is low, your type checking is optional, or your linting is unenforced, fix those things first. They are not prerequisites for AI productivity. They are AI governance. Without them, agents amplify debt.

Measure maintainability, not just velocity. The Stanford data shows that AI can increase output speed while degrading output quality. If you only measure speed, you will celebrate the acceleration while missing the spiral. Track code complexity, test coverage trends, and maintainability indices over time.

Treat the perception gap as a design constraint. Developers believe AI helps even when measured data says otherwise. This is not a criticism of developers. It is a fact about human cognition that your governance system must account for. Automated quality metrics are not optional supplements to developer judgment. They are necessary corrections to a known perceptual bias.

Reframe the investment. The conversation in most organizations is “which AI tools should we buy?” The better conversation is “how strong is the quality infrastructure that AI tools will operate within?” A $50,000 investment in test coverage, type safety, and CI pipeline rigor will produce better AI outcomes than a $500,000 investment in agent platforms deployed into a codebase with no quality gates.

Watch for the spiral. The 9% maintainability decline is a per-period rate. If you are deploying AI tools and not seeing quality infrastructure metrics hold steady or improve, you are in the spiral. The time to intervene is now, not when the degradation becomes visible to developers --- because by then the perception gap means it has been compounding for months.

The Uncomfortable Implication

The organizations best positioned for AI-assisted development are not the ones with the biggest AI budgets or the most advanced tooling. They are the ones that spent the last decade being disciplined about boring things: test coverage, type safety, linting enforcement, documentation standards, CI pipeline rigor.

Those organizations were not preparing for AI. They were practicing good engineering. It turns out those are the same thing.

And the organizations that cut corners --- that shipped without tests, that made type checking optional, that treated linting as a suggestion --- are now discovering that AI does not forgive technical debt. It compounds it. Every shortcut they took is now a vulnerability in their ability to use the most important productivity technology of this decade.

The governance layer for AI is not a new product category. It is the engineering discipline you either built or didn’t.


Sources

  • Yegor Denisov-Blanch et al. Stanford Study on AI-Assisted Software Development. Stanford University, 2025—2026. (100K+ developers, 600+ companies.)
  • METR. “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.” metr.org, July 2025.
  • Gartner. Prediction on agentic AI project failure rates. 2025.
  • Eno Reyes. “Code Smells for AI Agents.” Stack Overflow Blog (Q&A), February 4, 2026.

Victorino Group helps engineering organizations build the quality infrastructure that turns AI capability into sustained productivity --- not a degradation spiral. If your team is deploying AI development tools and wants to ensure the gains compound rather than erode, reach out.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation