The AI Control Problem

The Specification Problem Is the Governance Problem

TV
Thiago Victorino
10 min read
The Specification Problem Is the Governance Problem

Leonardo de Moura wants to prove your software correct. Not test it. Prove it. Mathematically, exhaustively, for every possible input, forever.

De Moura is the creator of Lean, a theorem prover that lets engineers write formal specifications and then verify that code satisfies them with the same certainty that 2+2=4. In late February 2026, he published an essay arguing that as AI generates more of the world’s software, formal verification becomes essential. Testing samples behavior. Proof covers all of it.

The argument is compelling. It is also incomplete in ways that reveal something more important than any single verification technique.

What Formal Verification Actually Does

The core idea is simple. You write a mathematical specification of what your software must do. Then you prove, using logic, that the implementation satisfies the specification. Not for some inputs. For all inputs.

Testing checks that your bridge holds under the loads you tested. Formal verification proves it holds under every load the physics allows. The difference is categorical.

De Moura points to real production systems where this works. Microsoft’s SymCrypt cryptographic library uses Lean (via the Aeneas tool) to verify its implementations. The Veil framework verifies distributed protocol correctness. AWS Cedar, an authorization policy engine, uses formal methods to guarantee that access control decisions are correct.

At Google, 25% of new code is AI-generated. At Microsoft, the figure is similar. Veracode’s 2025 analysis found that 45% of AI-generated code fails security testing across 100+ LLMs. The case for stronger verification is obvious. As we documented in The AI Verification Debt, organizations deployed AI coding tools without building infrastructure to verify the output. De Moura’s essay proposes one path forward.

The Lean Advocacy Problem

There is a disclosure that de Moura’s essay does not make. He created Lean. He leads the Lean Focused Research Organization. His career, his team, and his research funding depend on Lean’s adoption.

This does not make his technical arguments wrong. It means you should notice what the essay leaves out.

De Moura presents formal verification as if it were synonymous with Lean. It is not. Coq, one of the oldest proof assistants, has verified a complete C compiler (CompCert) and a substantial portion of the mathematical Four Color Theorem. Isabelle/HOL has verified the seL4 microkernel, arguably the most significant formal verification achievement in systems software. TLA+ (created by Leslie Lamport at Microsoft Research) is widely used for verifying distributed systems at Amazon and Microsoft. Dafny, also from Microsoft, is what AWS actually used for Cedar’s verification. F* is used for verified cryptographic implementations in Project Everest.

None of these tools appear in de Moura’s essay. The omission is not accidental. An objective assessment of formal verification’s potential would survey the field. An advocacy piece for a specific tool omits the competition.

The Cedar example is particularly telling. De Moura’s essay implies Cedar validates his approach. But Cedar uses Dafny, not Lean. AWS’s own engineering team, at the same company that employs de Moura, chose a different tool for their flagship verified system. This is not a minor footnote. It undermines the essay’s central narrative.

The Anthropic Paradox

De Moura cites Anthropic’s verified Rust compiler as evidence that AI and formal verification work together. The compiler translates Anthropic’s domain-specific language to Rust and is, according to de Moura, verified against its specification.

Here is the part the essay glides past. How was the compiler validated in practice? By testing. Anthropic engineers ran test suites, checked outputs, and verified behavior empirically. The formal proof provides a mathematical guarantee about the specification. The confidence that the specification itself is right came from the same old-fashioned engineering process that de Moura’s essay argues we should move beyond.

This is not a contradiction. It is a clue about where the real problem lives.

Where Proof Works (and Where It Doesn’t)

De Moura’s “verified stack” roadmap is revealing in its specificity: cryptography, core libraries, storage engines, parsers, compilers. These domains share three properties that make formal verification practical.

First, they have tight mathematical specifications. A cryptographic function either produces the correct ciphertext or it doesn’t. A parser either accepts valid input and rejects invalid input, or it doesn’t. There is no ambiguity about what “correct” means.

Second, they change slowly. A verified cryptographic library might be updated once a year. The cost of writing and maintaining formal proofs is amortized over long periods of stability.

Third, failures are catastrophic and well-defined. A bug in a cryptographic library can compromise millions of systems. The cost of verification is justified by the cost of failure.

Now consider business software. A claims processing system, a recommendation engine, a supply chain optimizer. These systems have specifications that change quarterly, correctness criteria that depend on business context, and failure modes that are subtle rather than catastrophic. Formally verifying a function that calculates shipping costs requires a formal specification of your shipping cost rules, which change every time you renegotiate a carrier contract.

Simon Willison observed that code generation became cheap while quality remained expensive. Formal verification makes quality extremely expensive for a narrow category of software. For the vast middle ground where most organizations operate, the economics don’t close.

The Specification Trap

Researchers at the National University of Singapore, working with the Veil verification framework, discovered something de Moura acknowledges but underplays. When AI generates both code and specifications, the specifications can be “vacuously correct.” They prove something, but not the thing you care about.

A specification that says “this function returns a number” is formally verifiable and practically useless. The proof is perfect. The guarantee is empty.

This is the specification trap. Formal verification guarantees that code matches its spec. It says nothing about whether the spec captures what actually matters. And writing good specifications is at least as hard as writing good code. Possibly harder, because a specification must be simultaneously precise enough for mathematical proof and complete enough to capture real-world requirements.

The AI specification problem compounds this. If AI generates code that humans struggle to verify (the core finding from our verification debt analysis), what happens when AI generates specifications that humans struggle to evaluate? You get mathematically proven software that does the wrong thing. With a certificate of correctness.

Specifications Encode Values

De Moura writes one sentence that is more important than his entire technical argument: “Specifications encode values.”

Think about what this means in practice.

A medical device specification encodes decisions about acceptable risk thresholds for patients. A voting system specification encodes assumptions about what constitutes a valid ballot. An AI safety monitor specification encodes judgments about which model behaviors are acceptable and which are not.

These are not engineering decisions. They are governance decisions. They determine what “correct” means for systems that affect people’s lives, rights, and safety.

When de Moura proposes that we formally verify AI-generated code, he is proposing something more radical than he acknowledges. He is proposing that organizations write mathematical descriptions of their governance policies and then prove that their software conforms to them.

This is powerful. It is also the hardest part of the entire enterprise, and no theorem prover helps with it. Lean can verify that your code matches your spec. It cannot tell you whether your spec reflects the right values, considers the right stakeholders, or accounts for the right edge cases.

The specification problem is the governance problem. Who writes the spec? Who reviews it? Who decides what “correct” means when correctness is not a mathematical fact but a social and organizational choice?

What This Means for Organizations

De Moura is right that testing alone is insufficient for the volume of AI-generated code entering production. The verification deficit is real, it is compounding, and organizations that ignore it will pay in production failures.

But the solution is not a single technique. It is a verification portfolio matched to the risk profile of each system.

For cryptographic libraries and security-critical infrastructure: formal verification is appropriate and increasingly practical. Invest in it. The tools exist (Lean, Coq, Dafny, F*, Isabelle). The expertise is scarce but buildable.

For business-critical systems with stable specifications: property-based testing, static analysis, and architectural fitness functions provide strong guarantees at manageable cost. These are not inferior to formal proof. They are appropriate to the domain.

For rapidly evolving business logic: specification-driven development (writing clear specs before generating code) combined with systematic review processes and automated quality gates. As we explored in How to Write Specs for AI Agents, the quality of your specification determines the quality of your AI-generated output.

For all systems: the governance question comes first. Before you verify code against a specification, you need to know that the specification reflects your organization’s actual requirements, risk tolerances, and values. That is not a technical problem. It is an organizational one.

The Bridge

De Moura’s essay is most valuable not as a case for Lean, but as an inadvertent case for treating specification as a governance discipline.

If writing specifications is writing governance policy, then specification review is governance review. Specification approval is governance approval. Specification maintenance is governance maintenance. The entire apparatus of organizational governance (stakeholder input, risk assessment, compliance review, change management) applies to specifications with the same force it applies to any other policy document.

Most organizations have not made this connection. They treat software specifications as technical artifacts written by engineers. The formal verification community’s insight, buried under layers of theorem-prover advocacy, is that specifications are policy artifacts that happen to be machine-readable.

The organizations that recognize this will build better software. Not because they adopted a particular proof assistant, but because they started treating the question “what should this software do?” with the same rigor they apply to the question “what should this organization do?”

That is the real verification infrastructure. Not the prover. The process that determines what gets proved.


This analysis synthesizes Leonardo de Moura’s “When AI Writes the World’s Software, Who Verifies It?” (February 2026), Veracode’s 2025 GenAI Code Security Report, and CISQ’s 2022 Cost of Poor Software Quality Report.

Victorino Group helps organizations build verification infrastructure that matches their AI ambitions. Let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation