The AI Control Problem

When AI Builds Its Own Tools: The Recursive Trust Problem

TV
Thiago Victorino
9 min read

In 1984, Ken Thompson received the Turing Award and used his acceptance lecture to destroy a comfortable assumption. The lecture was called “Reflections on Trusting Trust,” and its argument was deceptively simple: a compiler can be modified to inject backdoors into any program it compiles, including future versions of itself. Once compromised, the backdoor persists invisibly through every subsequent compilation. You cannot detect it by reading the source code. The source code is clean. The compiler is not.

Thompson’s conclusion: “You can’t trust code that you did not totally create yourself.”

Forty-two years later, OpenAI’s Codex team estimates that over 90% of the code in Codex is generated by Codex itself. Set aside whether that number is accurate (more on that below). Consider what it means structurally. An AI system is writing the code that defines its own behavior. And no human “totally created” it.

Thompson’s warning just became recursive.

Why Compiler Bootstrapping Was Trustworthy

The instinct is to compare AI self-generation to compiler bootstrapping, a well-understood practice in computer science. GCC compiles itself. The Rust compiler compiles itself. This has worked reliably for decades. Why should AI self-generation be different?

Because compiler bootstrapping has three properties that AI generation does not.

First, it is deterministic. Given the same source code and the same compiler version, you get the same binary. Every time. You can verify this independently. Reproducible builds are a solved problem in compiler engineering.

Second, it is auditable. The source code is written by humans, reviewed by humans, and version-controlled. The compiler is a tool that transforms human-written source into machine code. The trust chain runs from human intent through a deterministic process to a verifiable output.

Third, it is transparent. If a compiler introduces a Thompson-style backdoor, the backdoor exists in the compiler binary. In principle, you can disassemble the binary and find it. In practice, this is difficult. But the attack surface is bounded and knowable.

LLM-based code generation has none of these properties. It is non-deterministic: the same prompt produces different code each time. It is not auditable in the traditional sense: the “source” is a neural network with billions of parameters, not a readable codebase. And its failure modes are not bounded. An LLM does not inject specific backdoors. It produces code that is statistically plausible but may be subtly wrong in ways that are difficult to enumerate in advance.

The trust chain that made bootstrapping trustworthy simply does not exist for AI self-generation.

The Verification Arithmetic

If AI generates 90% of a codebase, humans need to verify 90% of the codebase. The question is whether they can.

The evidence says no.

Sonar’s 2026 State of Code survey asked 1,149 developers about their relationship with AI-generated code. The results: 96% do not fully trust AI output. Fewer than 48% always verify before committing. Developers know the code is unreliable. They commit it anyway, because verification at scale is slow, expensive, and organizationally unsupported.

METR, a research organization, ran a randomized controlled trial measuring developer productivity with AI assistance. Senior engineers spent 4.3 minutes reviewing each AI suggestion, compared to 1.2 minutes for human-written code. The AI code required 3.6x more review effort per unit. Scale that to 90% of a codebase and the review burden becomes untenable.

LinearB’s 2026 Engineering Benchmarks quantified the downstream effect across 8.1 million pull requests from 4,800 organizations. AI-generated pull requests have a 32.7% acceptance rate. Human-written pull requests: 84.4%. AI PRs wait 4.6 times longer before anyone reviews them. Teams with high AI adoption merge 98% more PRs, but review time increases 91%.

Generation is outpacing review. This is a verification deficit, compounding with every commit.

The 30/70 Trap

The Codex team describes their workflow as roughly 30% thinking, 70% AI execution. This ratio gets cited as a model for AI-native development. Think less, generate more.

The framing obscures a critical dependency. When AI handles 70% of execution, the quality of the entire output depends on the quality of the 30%. The thinking phase determines what gets built, how it gets structured, and what constraints the AI operates within. If the thinking is sloppy, the execution amplifies the sloppiness at scale.

But organizations do not measure the thinking phase. They measure the execution phase because it is visible. PR volume. Lines of code. Cycle time. Deployment frequency. These are all execution metrics. The thinking that produces good or bad execution is invisible to dashboards.

This creates a perverse incentive. Teams that spend more time thinking and less time generating will look less productive by every standard metric. Teams that generate rapidly with minimal forethought will look highly productive right up until the accumulated errors surface in production.

The 30/70 ratio is not inherently wrong. It becomes dangerous when organizations measure the 70 and ignore the 30.

Sandboxing Is Necessary but Not Sufficient

The Codex team runs AI-generated code in sandboxed environments without network access. This is good security practice. It prevents data exfiltration, unauthorized API calls, and network-based attacks.

It does not prevent logical errors. It does not detect architectural drift. It does not catch the accumulated complexity that emerges when parallel AI agents generate code without awareness of each other’s decisions. It does not address the subtle misalignment between what a developer intended and what the AI produced.

Sandboxing addresses a specific threat model: the AI as an adversary attempting to exfiltrate data or communicate externally. That threat model is real and worth mitigating. But it is a subset of the risks that AI self-generation introduces.

The space between “sandboxed” and “governed” is where organizational risk accumulates. A sandboxed AI that consistently produces subtly incorrect code poses no security threat. But it creates an engineering liability. And liabilities compound in ways that security controls cannot detect.

Tiered Review Is a Governance Decision, Not a Quality System

Some organizations respond to the verification burden by implementing tiered review: AI handles routine code review, humans review critical paths. This sounds reasonable. It is also an implicit concession that full verification of AI-generated code is impossible at current volumes.

That concession is probably correct. But it needs to be acknowledged as what it is: a risk stratification decision. The organization is deciding which code paths warrant human verification and which do not. That decision has consequences. The “non-critical” paths that receive only AI review will contain errors at whatever rate AI review misses them. Over time, those errors accumulate.

CodeRabbit and Ox Security found that AI-generated code produces 1.7 times more issues than human-written code. If tiered review routes 80% of that code to AI-only review, the organization is accepting a higher defect rate in 80% of its codebase as a deliberate trade-off.

This is acceptable governance if the organization acknowledges the trade-off, measures the defect rates, and adjusts the tiers based on evidence. It is poor governance if the organization presents tiered review as equivalent to full verification. The distinction matters because it determines whether leadership understands the actual risk posture.

The Historical Pattern

The concern here is not theoretical. Over-trust in automated systems has produced documented catastrophes.

The Therac-25 radiation therapy machine killed six patients between 1985 and 1987. The root cause was not a software bug per se. It was the removal of hardware safety interlocks because the development team trusted the software to handle safety. The software had a race condition. Without the hardware backup, the race condition was lethal.

Knight Capital lost $440 million in 45 minutes on August 1, 2012. An automated trading system deployed with a configuration error executed millions of unintended trades. Human operators could see the losses accumulating on screen but could not stop the system fast enough. The automation operated faster than human oversight could intervene.

Boeing’s MCAS system on the 737 MAX contributed to 346 deaths across two crashes. The system was designed to operate without pilot awareness under certain conditions. When the system received incorrect sensor data, it pushed the aircraft into repeated nosedives while the pilots fought controls they did not fully understand.

These are not analogies. They are the documented pattern: automated systems that operate beyond the boundary of human verification produce failures that human oversight cannot catch in time. The specific technology changes. The structural failure mode does not.

The Unverified Claim

A necessary caveat about the 90% figure. OpenAI’s team estimates that Codex generates over 90% of its own code. No methodology has been disclosed. No external audit has been conducted. The number is a team estimate, not a measured result.

For comparison, Redwood Research independently analyzed Anthropic’s similar claim about Claude’s codebase and found the actual figure for merged production code was closer to 50%. The difference between “90% of code touched by AI” and “90% of production code written by AI” is significant. The first includes suggestions that humans modified, code that was later rewritten, and experimental branches that were abandoned. The second is a much stronger claim.

Until OpenAI publishes its methodology, the 90% figure should be treated as directional, not precise. The governance implications hold at 50% too. The verification arithmetic is marginally better, but the structural problem is identical.

What This Requires

The recursive trust problem does not have a technical solution. Better AI will not fix it. The problem is the absence of organizational infrastructure to verify AI output at the rate AI produces it.

Four capabilities separate organizations that are managing this risk from those accumulating it.

Verification metrics that match generation metrics. If your dashboard shows PR volume and cycle time, it should also show acceptance rates for AI-generated code, defect rates by origin, and review coverage ratios. You cannot govern what you do not measure.

Explicit risk stratification. Decide which code paths require human review and which accept AI-only review. Document the decision. Measure the defect rates in each tier. Adjust based on evidence, not convenience.

Thinking-phase investment. Protect the 30% from organizational pressure to “just ship.” The engineers who spend more time on specifications, constraints, and architectural decisions before prompting the AI will produce better output. Their metrics will look worse. Reward them anyway.

Governance that scales with generation. When AI generates 50% of your code, your verification infrastructure needs to handle 50% of your code. When it reaches 90%, verification must scale proportionally. If adoption scales and governance stays flat, you are accumulating risk that no sandbox will contain.

Thompson was right in 1984. You cannot trust code you did not totally create yourself. That was a philosophical observation when humans wrote all the code. It is an operational reality when AI writes most of it.

The fix is not better AI. It is better governance of the AI you already have.


Sources

  • Thompson, Ken. “Reflections on Trusting Trust.” Turing Award Lecture, 1984. ACM.
  • Sonar. “State of Code 2026.” 1,149 developers. 96% do not fully trust AI output. 48% always verify.
  • METR. 2025 randomized controlled trial. Senior engineers: 4.3 min per AI review vs. 1.2 min for human code.
  • LinearB. “2026 Engineering Benchmarks.” 8.1M pull requests, 4,800 teams. AI PR acceptance rate: 32.7% vs. manual: 84.4%.
  • CodeRabbit / Ox Security. AI-generated code produces 1.7x more issues than human-written code.
  • Stack Overflow. “2025 Developer Survey.” 49,000+ respondents. Trust in AI accuracy: 33%.
  • Veracode. 2025 analysis. 40-48% of AI-generated code contains security vulnerabilities.
  • Leveson, Nancy and Clark Turner. “An Investigation of the Therac-25 Accidents.” IEEE Computer, 1993.
  • SEC. Knight Capital Group LLC administrative proceeding. File No. 3-15570, 2013.

For the detailed technical breakdown of how Codex is built, including architecture, agent infrastructure, and sandboxing implementation, read Gergely Orosz’s coverage in The Pragmatic Engineer newsletter.


Victorino Group helps organizations build governance infrastructure for AI-generated code. When your AI writes its own tools, the question is no longer whether the code works. It is whether you can verify that it does. Reach out at contact@victorinollc.com or visit www.victorinollc.com.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation