The AI Control Problem

The Verification Escalation: Why the Window for AI Safety Infrastructure Is Closing

TV
Thiago Victorino
10 min read
The Verification Escalation: Why the Window for AI Safety Infrastructure Is Closing

Since we published The Specification Problem Is the Governance Problem and You Are Not Killing Code Review, three independent signals have converged. Each one matters. Together, they form an argument that organizations can no longer defer.

The window for building verification infrastructure is closing. Not gradually. On a schedule.

Signal 1: The Volume Problem Has Numbers

Faros.ai studied 10,000+ developers across 1,255 engineering teams. The headline findings: teams with high AI adoption merged 98% more pull requests and completed 21% more tasks. Those numbers made their way into several breathless blog posts about AI productivity.

The number that did not make the rounds: review time increased 91%.

This is a capacity mismatch, not a productivity story. Engineers are generating code faster than organizations can verify it. The denominator (review capacity) has not grown. The numerator (code volume) has nearly doubled. That ratio breaks eventually, and “eventually” is closer than most engineering leaders assume.

A caveat worth stating plainly. Faros.ai’s data comes from engineering organizations mature enough to instrument their workflows. Correlation with AI adoption does not establish causation. Data-mature organizations differ from average ones in ways that confound simple comparisons. The 98% figure describes what happened in these specific teams, not what will happen in yours.

Still, the direction is clear. Google reports that 25% or more of its new code is AI-generated (Sundar Pichai, October 2025). Microsoft’s figure is around 30% (Satya Nadella, same period). These percentages will rise. Review capacity will not rise to match them. Something gives.

Signal 2: The Lean Creator Says Proofs Must Replace Review

Leonardo de Moura built Lean. In February 2026, he published an essay arguing that as AI writes more software, mathematical proof must replace traditional review. Not supplement it. Replace it.

We analyzed that essay in detail and found the advocacy problem: de Moura presents formal verification as if it were synonymous with Lean, omitting Coq, Isabelle, Dafny, F*, and TLA+. He has a career-long commitment to Lean’s adoption. His arguments should be weighed with that affiliation in mind.

What changed since our earlier analysis is that de Moura’s predictions started coming true faster than expected.

In early 2026, general-purpose AI (specifically Claude) formally verified zlib, the compression library used in roughly half the internet’s infrastructure. De Moura himself said this “was not expected to be possible yet.” The Lean mathematical library (Mathlib) now contains over 200,000 formally verified theorems, contributed by more than 750 people. AWS Cedar and Microsoft SymCrypt are verified in production using formal methods.

And then there is VeriBench. This benchmark measures how well frontier AI models handle formal verification tasks. The results are sobering: current models compile only 12.5% of verification benchmarks successfully. Agent-based architectures push that to roughly 60%. That is real progress and a real limitation, both at once. Formal verification is becoming practical for infrastructure components. It remains impractical for the business logic that constitutes most enterprise software.

De Moura made one observation in his essay that deserves more attention than it received: “If the same vendor provides both the AI and the verification, there is a conflict of interest.” This is the independence principle. The entity that generates code should not be the sole entity that verifies it. When the same model produces both the implementation and the test suite, failures become correlated. The verification loses its teeth.

Independent verification infrastructure is not optional. It is the structural requirement that makes everything else trustworthy.

Signal 3: The IPO Clock

Martin Dempsey, the former Chairman of the Joint Chiefs of Staff who now advises on AI governance, has been arguing that voluntary safety commitments from AI labs will not survive contact with public markets. His timeline: 6 to 18 months.

That timeline is VC speculation, not empirical prediction. The underlying concern is legitimate.

Kalshi prediction markets put the probability of an Anthropic IPO in the second half of 2026 at 72%. OpenAI is reportedly targeting Q4 2026. When these companies go public, their voluntary safety commitments become subject to fiduciary duty. A commitment that reduces revenue becomes a liability that management must justify to shareholders.

Consider what Anthropic did in February 2026. Its Responsible Scaling Policy v3 replaced categorical pause triggers (if a model reaches capability threshold X, stop) with conditional ones requiring both “AI race leadership” AND “material catastrophic risk” before a pause activates. Holden Karnofsky, Anthropic’s Head of Policy, argued the old approach created perverse incentives to underreport capabilities. That argument has merit. The structural effect is still a loosening: the bar for pausing went up.

This happened before an IPO. After one, the pressure intensifies.

On February 28, 2026, 573 Google employees and 93 OpenAI employees signed “We Will Not Be Divided,” a letter opposing attempts to split the AI safety community from AI development advocacy. The letter exists because the split is already happening. Internal safety teams at frontier labs face growing tension between caution and commercial velocity. An IPO resolves that tension in favor of velocity, because public markets reward growth.

Post-IPO regulation may increase external pressure for safety. It may. Or it may arrive too late, too weak, or too focused on the wrong metrics. Building verification infrastructure before the incentive structure shifts is not paranoia. It is sequencing.

Where These Signals Converge

Each signal alone is manageable. Volume is growing. Verification is advancing. Commercial incentives are shifting. Organizations can file each one under “interesting trend” and move on.

The convergence is what changes the calculus.

Volume growth (Signal 1) means organizations need verification infrastructure now, not eventually. Formal verification advances (Signal 2) mean that infrastructure is becoming technically feasible for critical components. The IPO timeline (Signal 3) means the entities currently providing AI models will soon face structural incentives to prioritize speed over safety.

The implication: organizations that rely on AI vendors to self-police will discover, within 18 months, that the vendors’ incentive to police has weakened. The organizations that built independent verification infrastructure will have it. The ones that didn’t will be shopping for it under pressure, paying premium prices for scarce expertise.

What Verification Infrastructure Actually Looks Like

We are not proposing that every organization learn Lean. Formal verification works for a specific category of software: infrastructure with stable specifications where failure is catastrophic. Cryptographic libraries. Authorization engines. Compression algorithms. Parsers. This is the domain where zlib verification matters and where tools like Lean, Coq, and Dafny earn their complexity cost.

For business-critical systems with changing requirements, verification means something different: structured specification review, property-based testing, automated quality gates, and (critically) human architectural review. As we explored in our code review analysis, 75% of the value in review comes from evolvability and maintainability checks that no automated tool captures.

For all systems, one principle holds. The specification is the governance artifact. Who writes it, who reviews it, who approves changes to it: these are governance decisions, not engineering decisions. De Moura is right that “specifications encode values.” That sentence is more important than anything about theorem provers.

CISQ estimates that poor software quality costs the US economy $2.41 trillion per year. That figure will grow as AI-generated code volume increases. The question is whether organizations invest in verification infrastructure proactively or pay for its absence in production failures, security breaches, and regulatory penalties.

The Independence Requirement

De Moura’s conflict-of-interest observation points to a structural requirement that most organizations have not internalized.

If your AI vendor generates the code, your AI vendor should not be the sole verifier of that code. If your verification tools come from the same company as your generation tools, the independence is cosmetic. A systematic flaw in the model will produce code that passes the model’s own checks.

This applies beyond formal verification. It applies to AI-assisted code review, AI-generated test suites, and AI-driven security scanning. Independence is not about using a different product. It is about ensuring that the verification pathway has no shared failure modes with the generation pathway.

Verified components, once they exist, become permanent public goods. A formally verified cryptographic library cannot be degraded by a vendor update. Its guarantees cannot be quietly revoked. This is the asymmetry that makes investment in verification infrastructure rational even when the upfront cost is high: verified components compound in value because they do not decay.

The Specification Shift

There is a deeper change embedded in these signals that most commentary misses.

When formal verification becomes practical for infrastructure components, specifications become the source of truth. Code becomes an artifact. A replaceable artifact. You can regenerate code from a spec. You cannot regenerate a spec from code (not reliably, not yet, possibly not ever in a meaningful sense).

This inverts the traditional relationship between specification and implementation. For fifty years, specifications have been documents that describe what code should do, often written after the code exists, often out of date. In a verification-first world, specifications are the durable asset. Code is ephemeral.

Organizations that grasp this shift will invest in specification quality, specification review processes, and specification governance. Organizations that miss it will continue optimizing code generation speed while the thing that actually matters (what the code should do, for whom, under what constraints) remains underspecified and unreviewed.

What to Do Now

The window is 12 to 18 months. Not because Dempsey said so, but because the IPO timeline, the code volume trajectory, and the verification tooling maturity all converge in that range.

For engineering leaders: Audit your verification surface. What percentage of AI-generated code receives independent review? If a model generated the code and a model reviewed it, and both models share a vendor, that is not independent review. Build the separation now.

For security and compliance teams: Formal verification for security-critical components is no longer theoretical. AWS and Microsoft are already doing it. Evaluate where in your stack formal methods would pay for themselves. Start with cryptographic implementations and authorization logic.

For executives: The cost of verification infrastructure is a fraction of the cost of verification failure. CISQ’s $2.41 trillion is the current cost, before AI-generated code volumes doubled. Your AI vendor’s safety commitments may not survive their IPO. Build your own verification capacity. Do not rent someone else’s.

For everyone: Treat specifications as governance artifacts. Review them with the same rigor you apply to policy documents. Because that is what they are.


This analysis synthesizes Faros.ai’s AI Productivity Paradox Report (2025), Leonardo de Moura’s “When AI Writes the World’s Software, Who Verifies It?” (February 2026), Anthropic’s RSP v3 Policy Update (February 2026), CISQ’s Cost of Poor Software Quality Report (2022), and the VeriBench Formal Verification Benchmark (2026).

Victorino Group helps organizations build independent verification infrastructure before the incentive window closes. Let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation