Formal Verification Just Got Working Receipts. Two of Them. In One Week.

TV
Thiago Victorino
7 min read
Formal Verification Just Got Working Receipts. Two of Them. In One Week.

For the past several months our spec-governance writing has been argument-heavy. We argued that specifications were the missing control layer. We argued that enterprise SDD adoption was outrunning its own governance. We brought back notes from Cloud Next on spec-driven development. We argued that agent specs themselves were governance artifacts. What we did not have was working code we could point at and say, “this is what the verification spine looks like when someone actually builds it.”

In one week we got two.

Antfly published a five-step workflow that uses AI agents to write TLA+ specifications, runs the model checker against a production-grade key/value store, and surfaces a real concurrency bug. Reuben Brooks published Shen-Backpressure, a compiler that turns sequent-calculus type definitions into sealed guard types in Go, TypeScript, Python, and Rust, refusing invalid agent code at compile time. Different abstraction layers. Different languages. Same underlying thesis: when generation is cheap, verification has to be structural, not optional.

This piece is not a re-argument. The pattern is no longer hypothetical. The piece is: here is what the two stacks look like, and here is what they tell us about where the verification spine is going.

What Antfly Actually Did

Rowan Copley’s Cheap Code Means Formal Verification Is Reasonable Now takes a workflow most engineering teams would file under “academic” and shows it running against Pebble, the key/value store underneath Cockroach Labs. The team picked a historical Pebble race condition as the benchmark and asked: can an agent-driven TLA+ workflow find this bug without prior knowledge of it?

The answer was yes, and the workflow that produced it has five steps:

  1. Write an assumptions.md and a boundaries.md that describe what the system is and what the verification is allowed to touch.
  2. Have the agent write TLA+ specifications, run the model checker, and report findings.
  3. Validate the findings against the actual source.
  4. Create unit tests that reproduce the bug in production code.
  5. Fix the bug and document the result for stakeholder personas.

The headline finding was the race condition. The quieter finding was the QPS optimization loop: the same workflow, pointed at a metric instead of a correctness property, hill-climbed performance by “orders of magnitude.” Formal verification, in this stack, is not just a bug finder. It is a search procedure over a defined state space, and the state space happens to include “fast” alongside “correct.”

The cost story is the part that changes the conversation. TLA+ has existed for decades. Engineering teams have not adopted it broadly because the up-front investment to write the spec was larger than the expected value of finding the bug. When the spec writer is an agent operating from an assumptions.md file, the cost of formal verification collapses. The decision is no longer “is this bug worth a week of TLA+ work.” The decision is “is this subsystem worth running through the verification loop.” The answer becomes yes far more often.

What Reuben Brooks Actually Did

Reuben Brooks’s Structural Backpressure Beats Smarter Agents attacks the verification problem one layer lower. Where Antfly catches concurrency bugs through model checking, Brooks catches authorization bugs (and dozens of other state-shape errors) through types. The vehicle is Shen, a statically typed Lisp with sequent-calculus types, used to generate sealed guard types in target languages.

The example in the post is direct. An agent writes a Go function that uses a tenant ID without going through the authorization gate. The compiler refuses:

cannot use tenantID (variable of type string) as shenguard.TenantId

The agent did not need to be smarter. The compiler refused to accept the shape of the value the agent produced. Brooks’s framing is that the loop around the agent has five gates per iteration of his sb CLI:

  1. Specification generation through shengen.
  2. Tests.
  3. Compilation.
  4. Shen type-check.
  5. Audit scripts.

The line that matters: structural gates “produce definitive answers within their scope, operating independently of model capability.” Translation: the gate does not get smarter when the model gets smarter, and it does not get dumber when the model gets dumber. It produces the same answer for the same input. That is the property that makes a verification spine.

This is the same principle the Antfly TLA+ workflow encodes at a different layer. The model checker either finds a counterexample or it does not. The output does not depend on which agent ran it.

Two Layers, One Architecture

Stack the two pieces and the picture comes into focus.

At the spec layer, Antfly has agents write TLA+ specs from a constrained assumptions.md. The model checker is the gate. The output is a counterexample trace or a clean run. The agent’s job is to compose the spec, not to be trusted with the verdict.

At the type layer, Brooks has agents emit code in target languages. The compiler is the gate. The output is a refusal or a passing build. The agent’s job is to produce code that satisfies the types, not to be trusted with safety.

Different layers. Same architecture. The agent is the labor; the verifier is the floor. The verifier does not need to understand intent. It needs to refuse the wrong shape.

This is what we have been pointing at for months. The governance deficit in enterprise SDD was a description of the missing floor. Agent specs as governance artifacts was a description of the missing labor input. The two pieces this week are working assemblies, in different programming-language families, at different abstraction layers, with different verification techniques. They are not academic. Antfly’s workflow reproduced a real Pebble bug. Brooks’s compiler refuses real Go code. The receipts are in.

What This Says About the Next Twelve Months

If you have been waiting for the formal-verification-meets-agents pattern to leave the conference talk and enter the codebase, the wait is over. Two practitioners shipped working implementations in one week. They will not be the last. The pattern is too clean and the cost economics are too favorable for the next batch of teams to ignore.

Three implications follow.

First, the verifier becomes the differentiator. If two agent-generated patches both compile and both pass tests but only one passes a model-checker run, the model-checker run is the thing that lets you ship. The team that has the verification spine ships faster than the team that does not, because the team without the spine still has to discover the bug in production.

Second, the spec becomes an asset. assumptions.md and boundaries.md are not throwaway prompts. They are the verification contract for a subsystem, and they live as long as the subsystem does. Teams that write these well accumulate a library of verifiable surfaces. Teams that do not, do not.

Third, the abstraction layer is open. Antfly works at the system-design layer. Brooks works at the type-system layer. Nothing prevents a future team from working at the SQL layer (compile-time schema validation against agent-issued migrations), the API contract layer (refusing agent-generated HTTP handlers that violate published schemas), or the policy layer (refusing agent actions that violate IAM contracts before they reach the runtime). The pattern travels.

Do This Now

Pick one subsystem in your codebase where a wrong answer would be expensive. Write the assumptions.md for it: what it is, what it depends on, what invariants must hold. Write the boundaries.md: what the verification is allowed to touch, what it is forbidden from changing.

Now ask: which of the two stacks fits your subsystem? If the failure mode is concurrency or state machine drift, the Antfly TLA+ workflow is your starting point. If the failure mode is unauthorized access, untrusted data shape, or skipped authorization, the Brooks Shen-Backpressure pattern is your starting point. Run one iteration of the loop. See what the verifier refuses.

The receipts are in. The verification spine is no longer theoretical. Build the floor before your agents need it.


This analysis synthesizes Cheap Code Means Formal Verification Is Reasonable Now (Antfly, May 2026) and Structural Backpressure Beats Smarter Agents (Reuben Brooks, May 2026).

Victorino Group helps engineering teams add structural verification spines to their agent workflows. Let’s talk.

All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation