The AI Control Problem

When AI Builds Itself: The Governance Gap in AI-Assisted Development

TV
Thiago Victorino
10 min read

Cognition, the company behind the AI coding agent Devin, now uses Devin to build Devin. In a February 2026 post, Nader Dabit --- who joined Cognition one week earlier as a DevRel leader --- described how the company’s own engineers assign tasks to the agent, review its pull requests, and merge its code into the product that the agent itself runs on.

The recursion is interesting. But the recursion is not the story.

The story is what happens when code generation outpaces every organization’s capacity to verify what was generated. And that story is not about Cognition. It is about every engineering organization that has adopted AI-assisted development without rethinking how code gets reviewed, validated, and maintained.

The Bottleneck Has Moved

For decades, the bottleneck in software development was writing code. Design the feature, figure out the logic, type it out, test it, fix it, ship it. The constraint was the speed at which a human could translate intent into working software.

AI coding tools have effectively dissolved that constraint. GitHub’s Octoverse report shows that 41% of new code on the platform is now AI-assisted. Cognition claims 30% of its own codebase is Devin-generated. Microsoft reports a similar figure for Copilot contributions. Google says 25% of its new code comes from AI.

The code is getting written. The question is who reads it.

Faros AI analyzed data across 10,000 developers and found a 98% increase in pull request volume alongside a 91% increase in review time. That is not a rounding error. It means the volume of code waiting for human review has nearly doubled, and the time humans spend reviewing each piece has nearly doubled as well. The review pipeline is collapsing under the weight of AI output.

Werner Vogels, Amazon’s CTO, put it plainly at re:Invent 2025: “You will write less code. You will review more code.”

The market has noticed. Cursor acquired Graphite, a code review platform, for over $290 million in December 2025. That acquisition makes no sense if code review is a solved problem. It makes perfect sense if code review is the new critical path.

The Productivity Paradox

Here is the part that should concern anyone making investment decisions around AI development tools.

Atlassian’s 2025 developer survey found that 99% of developers using AI tools report saving more than 10 hours per week. That sounds remarkable. But the same report found no measurable decrease in overall workload. More code is being produced. More code requires review. More code introduces dependencies, edge cases, and maintenance obligations. The time saved on generation is consumed by the overhead of everything that comes after generation.

A study from METR (Model Evaluation and Threat Research) found something more striking: AI coding tools resulted in a 19% net slowdown in task completion, despite users believing they were 20% faster. The subjective experience of productivity diverged sharply from the objective measurement of it.

This is not an argument against AI coding tools. It is an argument against the assumption that faster code generation automatically produces faster delivery. Generation is one step in a pipeline. If every other step in the pipeline --- review, testing, integration, security analysis, maintenance --- is unchanged or degraded, then faster generation produces more work-in-progress, not more finished work.

CodeRabbit’s data indicates that AI-assisted code reviews surface 1.7 times more issues per review than purely human reviews. That is partly because AI can detect patterns humans miss, but it is also because AI-generated code contains patterns that require more scrutiny. Senior engineers report spending 4.3 minutes reviewing AI-generated code suggestions compared to 1.2 minutes for human-written code. The code looks clean. The logic requires deeper inspection.

The Junior Engineer Ceiling

Cognition’s own internal heuristic is revealing. Dabit describes the rule for assigning work to Devin: “If a junior engineer could figure it out with sufficient instructions, it’s a good Devin task.”

This maps directly to what Addy Osmani calls the “80% Problem” in agentic coding. AI agents handle roughly 80% of a task rapidly --- the structured, pattern-matching, well-documented parts. The remaining 20% --- the ambiguity, the judgment calls, the edge cases that require understanding context beyond the immediate code --- is where they stall.

The junior engineer heuristic is honest. It is also a ceiling. And it raises a question that the AI coding industry has not addressed: if AI permanently absorbs the work that junior engineers used to do, how does the next generation of senior engineers develop?

Junior engineering work is not just cheap labor. It is the training pipeline. It is where engineers develop the judgment, debugging intuition, and systemic understanding that makes them capable of the complex 20% that AI cannot handle. Entry-level software engineering hiring dropped 50% between 2023 and 2025. If that trend continues, organizations will face a skills gap not because AI replaced senior engineers, but because AI eliminated the path that produces them.

This is not a theoretical concern. It is a structural one. And it has no obvious solution within the current model of AI-assisted development.

The Dogfooding Illusion

Cognition using Devin to build Devin is presented as a trust signal. The logic: if the company is willing to stake its own product on the tool, the tool must be good enough for your organization too.

This argument has surface appeal but fails under scrutiny. It is survivorship bias in product marketing. We see the pull requests that were merged. We do not see the ones that were rejected, the ones that required extensive rework, or the ones that introduced subtle bugs caught only later. Cognition reports a 67% merge rate for Devin-generated PRs. That is positioned as a success. But a 33% rejection rate for an automated system pushing code into a production codebase is not negligible. One in three contributions fails review. At scale, that is an enormous amount of engineering time spent reviewing code that should not have been written.

Microsoft dogfoods Copilot. Google dogfoods its own tools. The pattern is real. But dogfooding is a product development practice, not evidence of production readiness for every context. The conditions inside these companies --- elite review infrastructure, extensive test suites, deeply experienced engineering teams --- are not the conditions inside most organizations deploying the same tools.

Kent Beck, in his September 2025 essay “Programming Deflation,” offered a more honest framing: code-writing is becoming a commodity. The value is migrating to understanding, integration, and judgment. The Sonar CEO echoed this: “Value is no longer defined by the speed of writing code, but by the confidence in deploying it.”

Confidence in deployment requires review infrastructure. And most organizations do not have it.

The Security Question Nobody Wants to Ask

AI-generated code introduces security risks at a rate that is difficult to dismiss. Veracode research found that AI-generated code contains security flaws in 45% of test cases. Fortune 50 enterprises have reported a tenfold increase in monthly security findings since adopting AI coding tools. Code cloning --- where AI reproduces known vulnerable patterns --- has surged fourfold.

Meanwhile, one of the promises of AI coding is democratization: non-engineers can now contribute code through natural language. Cognition highlights this as a feature. It is also an unresolved governance problem. When a product manager or designer pushes AI-generated code, who reviews it? Against what security standards? With what authority to merge or reject?

Traditional code review assumes the author has engineering context. AI-generated code from non-engineers violates that assumption entirely. The review burden shifts entirely to the reviewer, who must now evaluate not just the code’s correctness but the author’s understanding of what the code does. This is a qualitatively different --- and more demanding --- form of review.

Dark Reading has described this as a “security nightmare” in the making. That may be alarmist. But the governance gap is real. Organizations are enabling new categories of contributors without establishing new categories of oversight.

Developer Trust Is Declining While Usage Rises

There is one more data point worth sitting with. Developer trust in AI coding tools has dropped from 43% to 29% over the past 18 months, even as adoption has risen to 84%.

This divergence tells a clear story: developers use these tools because they produce output. They do not trust these tools because they have seen what that output costs downstream --- in review time, in debugging, in subtle regressions, in the slow accumulation of code that nobody fully understands.

Usage without trust is a fragile adoption pattern. It works as long as the perceived productivity gains outweigh the felt costs. The moment that equation shifts --- a major incident traced to unreviewed AI code, a security breach from a cloned vulnerability, a critical system failure in code nobody can explain --- the backlash will be proportional to the uncritical adoption that preceded it.

What Organizations Should Be Doing

The problem is not AI-generated code. The problem is the absence of governance infrastructure scaled to the volume and nature of AI-generated code. Five things matter.

Invest in review infrastructure before investing in generation tools. The bottleneck is review. Adding more generation capacity without addressing review capacity makes the problem worse. This means tooling (automated code analysis, AI-assisted review), process (clear review standards for AI-generated code), and people (sufficient senior engineering capacity to review at the volume AI produces).

Establish explicit policies for AI-generated contributions. Who can submit AI-generated code? What review standards apply? Are there categories of code --- security-critical paths, data-handling logic, authentication systems --- where AI-generated contributions are prohibited or require elevated review? These policies do not exist in most organizations. They should.

Preserve the junior engineering pipeline. If AI absorbs all junior-level tasks, create deliberate learning pathways that expose developing engineers to the complexity, debugging, and system-level thinking that builds expertise. This is not sentimentality about “the way we used to do things.” It is a practical concern about maintaining the engineering workforce that AI tools depend on having available as reviewers.

Treat the productivity paradox as real until proven otherwise. Do not assume that faster generation means faster delivery. Measure end-to-end cycle time, not just code output. Track review queue depth, defect rates in AI-generated code, and time-to-resolution for AI-introduced bugs. If the data shows net productivity gains, continue. If it shows the paradox, restructure accordingly.

Close the feedback loop on AI-generated code in production. Track which code was AI-generated. Monitor its defect rate, its security incident rate, its maintenance cost relative to human-written code. Without this data, you are making governance decisions based on vendor marketing, not evidence.

The Recursion That Matters

Cognition using Devin to build Devin is a compelling narrative. It demonstrates confidence. It generates headlines. It showcases the tool’s capabilities in a controlled, high-competence environment.

But the recursion that matters for the industry is different. It is the feedback loop --- or lack of one --- between code generation and code governance. As AI generates more code, organizations need more review capacity. As review becomes the bottleneck, organizations are tempted to use AI for review. As AI reviews AI-generated code, the human in the loop becomes thinner. As the human becomes thinner, the ability to catch the errors that AI systems share --- the systematic blind spots, the patterned vulnerabilities, the confident-but-wrong logic --- degrades.

This is not a problem that solves itself. It is a problem that compounds.

The organizations that navigate it successfully will not be the ones that adopt AI coding tools fastest. They will be the ones that build the governance, review, and verification infrastructure to match the pace of what those tools produce. The value was never in writing code. It was always in knowing whether the code should be deployed.


Sources

  • Nader Dabit. “How Cognition Uses Devin to Build Devin.” Substack, February 11, 2026.
  • Faros AI. Developer Productivity Report, 2025.
  • Werner Vogels. Keynote, AWS re:Invent 2025.
  • GitHub. Octoverse 2025 Report.
  • Atlassian. State of Developer Experience Report, 2025.
  • METR (Model Evaluation and Threat Research). “Do AI Coding Tools Speed Up Development?” 2025.
  • CodeRabbit. AI Code Review Benchmark Report, 2025.
  • Kent Beck. “Programming Deflation.” September 2025.
  • Addy Osmani. “The 80% Problem in Agentic Coding.” 2025.
  • Veracode. State of Software Security Report, 2025.
  • Dark Reading. “AI-Generated Code: A Security Nightmare in the Making.” 2025.
  • Cursor / Graphite acquisition. December 2025.

Victorino Group helps organizations build the governance infrastructure that keeps human judgment at the center of AI-assisted development. If your code generation has outpaced your review capacity, let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation