Seven Hours to Generate, One Week to Trust

TV
Thiago Victorino
9 min read
Seven Hours to Generate, One Week to Trust
Listen to this article

Reco.ai published a blog post this week with a headline designed to impress: “We Rewrote JSONata with AI in a Day, Saved $500K/Year.” The numbers are real. So is the part they buried in paragraph twelve.

The AI generation took seven hours. The verification took a full week.

Seven hours of prompting produced 13,000 lines of Go, a complete reimplementation of the JSONata 2.x specification. Cost: $400 in tokens. The code passed 1,778 test cases. It compiled. It ran. And then the real work started.

Reco’s team ran the output against 2,107 production integration tests. They deployed gnata (their open-source result) in shadow mode alongside the existing JSONata implementation, evaluating every expression against real production traffic. Billions of events, thousands of distinct expressions. They logged mismatches. They fixed edge cases. They waited for three consecutive days of zero mismatches before promoting gnata to primary.

The generation-to-verification ratio was roughly 1:24. For every hour the AI spent writing code, the team spent twenty-four hours confirming it could be trusted.

The $500K Headline, Unpacked

The savings claim deserves scrutiny. Reco bundles two separate projects: the JSONata rewrite (eliminating ~$300K/year in Kubernetes compute by replacing jsonata-js pods with a native Go implementation) and a separate rule engine refactor (saving an additional ~$200K). The “1000x speedup” on common expressions is partly from eliminating RPC overhead between services, not purely from better code.

None of this makes the project less impressive. It makes it more instructive. The honest version of the story is better than the headline version. A team with deep domain knowledge used AI to generate a first draft of a complex language reimplementation, then spent a week building the evidence that the draft was production-worthy.

That ratio, generation fast and verification slow, is the pattern every engineering organization should be studying.

The Constraint Hierarchy

Will Larson published “Judgment and creativity are all you need” on March 11. Larson is the CTO of Imprint (formerly at Stripe) and the author of Staff Engineer, a book that shaped how the industry thinks about senior technical roles. His argument is structural, not aspirational.

Larson describes a constraint hierarchy for software engineering in 2026: Time, then Attention, then Judgment, then Creativity. Each constraint only becomes binding after the one above it is solved.

Coding agents solved the Time constraint. Writing code is fast now. But speed created a new problem. Larson’s team at Imprint spends almost all their time on two activities: designing approaches and reviewing coding agent pull requests. When a design doesn’t survive contact with reality, they revise and iterate. The actual typing of code is a rounding error.

This matches the Reco.ai data precisely. The seven hours of generation solved the time constraint. The week of verification was the judgment constraint asserting itself.

As we explored in The Speed Trap, faster code production does not mean faster delivery. Reco’s numbers put a price on the claim. The generation cost $400. The verification cost a week of senior engineering time. In any reasonable accounting, the verification was the more expensive phase by an order of magnitude.

Larson goes further. He argues that “datapacks” (expert context packages for coding agents) represent the mechanism for scaling judgment itself. The industry is building skill repositories for agents: shared configurations, domain-specific rules, architectural patterns encoded as agent instructions. This is judgment-as-infrastructure. It is also, whether Larson uses the word or not, governance.

Coding Guidelines Are Governance Documents

The same week, Stack Overflow published “Building shared coding guidelines for AI (and people too).” The piece gathers enterprise leaders who converge on a single operational insight: coding guidelines for AI agents are governance infrastructure.

Charity Majors, CTO of Honeycomb, has been making this argument for years in the context of observability. Quinn Slack, CEO of Sourcegraph, has built tooling around code intelligence. Vish Abrams, Chief Architect at Heroku, works at the infrastructure layer where bad code becomes expensive fast. Logan Kilpatrick speaks from Google DeepMind’s perspective on agent capabilities.

Their consensus: when agents write most of the code, the standards those agents follow become the primary quality control mechanism. A style guide is no longer a suggestion for junior developers. It is a constraint layer for autonomous systems.

This confirms at enterprise scale what we argued about style guides as governance layers. The difference is scope. When we wrote that piece, the pattern was emerging in content operations teams encoding brand rules into CLAUDE.md files. Now CTOs of billion-dollar companies are describing the same pattern for production codebases. The mechanism is identical. The stakes are higher.

The 1:24 Ratio and What It Means

The Reco.ai story is valuable because it gives us a concrete number for something the industry has been discussing in abstractions. One hour of AI generation requires twenty-four hours of human verification. Call it the 1:24 ratio.

Is this ratio universal? No. Simple CRUD applications probably have a lower verification burden. Safety-critical systems probably have a higher one. But the order of magnitude is instructive. Even in a best-case scenario (experienced team, clear specification, extensive test suite, open-source transparency), the verification phase dominated the project timeline by a factor of twenty-four.

Industry data supports the pattern at scale. Sonar’s 2026 survey found 96% of developers don’t fully trust AI-generated code. Only 48% verify outputs before committing. Faros AI, studying over 10,000 developers, found PR review time increased 91% in teams with high AI adoption.

The developers are rational. They know the code needs checking. Many of them skip the checking anyway because the verification infrastructure doesn’t exist at their organizations. The code arrives faster than anyone can evaluate it. As we explored in our analysis of AI judgment metrics, nobody is scoring agent decisions systematically. Larson’s constraint hierarchy explains why: most organizations haven’t recognized that judgment, not generation, is where they should be investing.

What the Convergence Reveals

Three independent sources, published within the same two-week window, none citing each other, arriving at compatible conclusions. Reco.ai provides the data. Larson provides the theory. Stack Overflow provides the enterprise consensus.

The synthesis:

Generation is solved. The cost of producing code dropped to near-zero. This is no longer news. It is an accomplished fact.

Verification is the bottleneck. The cost of confirming that generated code is correct, safe, performant, and aligned with organizational standards has not dropped. In many organizations, it has increased because the volume of generated code overwhelms existing review capacity.

Judgment is the last constraint. Once you can generate code instantly and verify it systematically, the remaining constraint is deciding what to build, how to build it, and whether the result serves the actual need. Larson calls this judgment. Gothelf calls it product sense. The Stack Overflow experts call it coding guidelines. They are all describing the same capability: the ability to evaluate whether generated output is good.

Governance is the infrastructure for judgment at scale. An individual developer with deep expertise can evaluate AI output. An organization with hundreds of developers cannot rely on individual expertise. It needs systems: coding standards encoded as agent constraints, verification pipelines that match generation throughput, review processes that focus human attention on judgment calls rather than syntax.

The Investment Implication

Most organizations are investing in generation. Better models, faster agents, more tokens, broader coverage. This is optimizing the seven-hour part of a week-long process.

The organizations that will win are investing in verification. Test infrastructure that scales with generation volume. Shadow deployment systems like Reco’s that validate AI output against production reality. Coding guidelines that function as governance documents. Review processes redesigned around the 1:24 ratio, where human time is allocated to judgment, not line-by-line inspection.

Reco.ai got the hard part right. They didn’t just generate fast. They built the trust infrastructure to prove the generation was trustworthy. The $400 in tokens was the easy expense. The week of systematic verification was the investment that made the $500K savings real.

The question for every engineering leader is simple: you have the generation capacity. Do you have the verification infrastructure to match it?


This analysis synthesizes We Rewrote JSONata with AI in a Day, Saved $500K/Year (March 2026), Judgment and creativity are all you need (March 2026), and Building shared coding guidelines for AI (and people too) (March 2026), with supporting data from Sonar’s State of Code 2026 survey and Faros AI’s engineering productivity benchmarks.

Victorino Group helps enterprises build verification infrastructure that matches AI generation speed. Let’s talk.

All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation