The 65% Rule: Why Production AI Is Becoming Mostly Code

Here is the paradox nobody is talking about: as AI becomes more capable, production AI systems are becoming more code, not less.

At the individual level, developers write less code than ever. Karpathy flipped to 80% AI-generated. Cursor reports 35% of internal PRs come from autonomous agents. The narrative is clear: AI is eating software development.

But zoom out from the individual to the system, and the opposite is happening. The teams running AI in production are systematically replacing LLM calls with deterministic code. Not because AI failed. Because this is what actually works when you have to ship.

One Team’s Production Telemetry

Tom Tunguz at Theory Ventures published something useful in February 2026. His team analyzed 14 production agent workflows after running them for six months. The numbers were specific: 65% of nodes across those workflows execute as deterministic code. Not LLM calls. Not agentic reasoning. Plain code.

The distribution is more revealing than the average. Of those 14 workflows, 29% use zero LLM calls. They started as agentic experiments and ended as traditional software. Another 36% are between 67% and 91% deterministic. Only 14% remain fully agentic.

Let me be precise about the limitations. This is N=1 data. One team’s workflows, one firm’s production environment. It is not an industry study, and anyone citing “65%” as an established benchmark is overstating the evidence.

But the pattern is directionally interesting for a specific reason: it matches what other teams report independently, through completely different lenses.

Stripe’s Architecture Is the Strongest Evidence

Stripe published a detailed engineering blog post about their agent architecture, Minion. It merges over 1,000 pull requests per week with zero human-written code. The architecture uses what they call Blueprints: directed acyclic graphs where deterministic nodes (git operations, linting, CI validation) alternate with agentic nodes (code generation, reasoning about test failures).

Two design decisions stand out.

First, the two-round CI cap. When a Minion-generated PR fails continuous integration, the agent gets exactly two attempts to fix it. After two rounds, the task escalates. This is not a technical limitation. Stripe could easily allow five rounds or ten. It is a deliberate governance constraint that prevents agents from chasing their own tails in production infrastructure.

Second, the tool architecture. Stripe built Toolshed, a platform providing over 400 MCP tools. But those tools are overwhelmingly deterministic operations: read this file, run this linter, execute this test suite. The LLM decides which tools to invoke and interprets the results. The tools themselves are code.

This is the hybrid architecture Tunguz’s data points toward, implemented at massive scale by a company processing hundreds of billions of dollars in payments. The LLM handles what LLMs are good at (reasoning about ambiguous situations, interpreting error messages, deciding what to try next). Everything else is code.

As we explored in The Most Governed Software Factory, StrongDM took this principle to its logical extreme: zero human-written code, compensated by layers of deterministic validation. Stripe’s architecture is a less extreme version of the same insight. Reduce the surface area of non-deterministic behavior. Compensate with governance mechanisms that run at machine speed.

Maturation or Containment?

There is an honest counterargument to this entire thesis. Maybe the high ratio of deterministic code does not represent architectural maturation. Maybe it represents coping. Teams discover that LLMs are unreliable in production, so they wall them off with deterministic guards. The 65% is not wisdom. It is containment.

Gartner predicts that 40% or more of agentic AI projects will fail or be abandoned by 2027. If the industry is building containment architecture rather than mature architecture, those failure rates make sense. You can only contain so much before the maintenance cost exceeds the benefit.

This is worth taking seriously. But here is why the distinction may not matter for operational purposes: whether you call it maturation or containment, the governance implication is identical. You need deterministic validation. You need escalation boundaries. You need monitoring at the workflow level, not just the model level. You need to know which nodes are agentic and which are code, and you need different oversight mechanisms for each.

The architecture is the same whether you are optimistic or pessimistic about where LLMs end up. Build the deterministic scaffold. Let the ratio shift over time as models improve. The scaffold is never wasted because it is also your governance layer.

The Prompt Portability Problem

Jason Lemkin at SaaStr surfaced a different angle on why deterministic code matters more than the AI layer. He described an AI company with over $100M in annual recurring revenue that closes only one-year deals. The reason: customers discovered they could migrate prompt-based agents to competing models in minutes. Fifty to eighty percent of the migration happened automatically.

If your product is primarily prompts talking to an LLM, your moat evaporates the moment a cheaper or better model appears. The customer takes your prompt patterns, points them at the new model, and walks away.

This is anecdotal evidence from one SaaS operator, and the portability claim applies specifically to prompt-heavy agents in a pre-fine-tuning, pre-deep-RAG era. Organizations that have invested in fine-tuned models or proprietary training data have stickier products. But the directional point holds: the AI layer is commoditizing faster than most companies priced into their valuations.

The durable value lives in the deterministic layer. The workflow logic. The integration code. The validation rules. The governance artifacts. These do not port to a new model in minutes. They represent genuine institutional knowledge encoded as software.

What Workpath Found When They Looked

If the production data tells us what architecture works, the user research tells us why the naive alternative fails.

Workpath’s Noa Aboy published findings from over 1,000 manual conversation reviews of their agent product. Users called only 5 of 50 available tools. Half of all requests were for bulk operations the agent could not perform. Tools failed silently, returning success indicators while producing no useful result. Aboy’s summary was precise: “everything failing successfully.”

This is what happens when you build fully agentic systems without the deterministic scaffold. The agent has tools. The tools technically execute. But the system lacks the validation layer to distinguish between “the tool ran” and “the tool accomplished what the user needed.” Without deterministic checkpoints, failures are invisible until a human manually reviews the output.

Stripe solved this with two-round CI caps and 400 deterministic tools. StrongDM solved it with scenarios and digital twins. Both architectures share the same insight: you need machine-speed validation, not just machine-speed execution.

The Governance Architecture

Pull the threads together and a framework emerges.

At the individual level, AI handles most of the code generation. Developers write less, review more, direct more. This is Karpathy’s 80/20 flip, and it is real.

At the system level, the ratio inverts. Production architectures are converging toward mostly-deterministic workflows with AI handling the genuinely ambiguous decisions. The deterministic code is not overhead. It is the governance layer.

At the organizational level, as we examined in The Operations Deficit, the constraint is not model capability. It is operational maturity. The teams that succeed are the ones building feedback loops, monitoring infrastructure, and graduated autonomy frameworks.

These three levels connect. The individual writes less code. The system uses less AI. The organization needs more operational discipline than ever. None of these statements contradict each other. They describe the same reality from different altitudes.

What This Means

If governance artifacts are the durable asset (not the prompts, not the model choice, not the AI layer), then three things follow.

First, track your deterministic ratio. For every agentic workflow in production, measure what percentage of nodes are deterministic code versus LLM calls. If you do not know this number, you do not understand your own system’s behavior. Stripe knows. Tunguz’s team knows. You should too.

Second, invest in the scaffold before scaling the AI. Validation rules, CI constraints, escalation boundaries, monitoring. These are not bureaucratic overhead. They are the architecture. As the StrongDM analysis showed, the most aggressively AI-native company in the industry built the most rigorous governance framework. Not despite the automation. Because of it.

Third, treat prompt portability as a feature, not a bug. If your governance architecture is strong enough, swapping the underlying model is a configuration change. Your competitive advantage lives in the workflow logic, the validation layer, and the operational discipline. Not in which model you call.

The 65% is a heuristic, not a law. The specific number will vary by domain, by team, by use case. But the direction is consistent across every credible production report: more deterministic code, not less. More governance architecture, not less. More operational discipline, not less.

Production AI is becoming mostly code. The organizations that understand this will build systems that actually work.

This analysis synthesizes Tom Tunguz’s production agent analysis (February 2026), Stripe’s engineering blog on Minion architecture (February 2026), SaaStr on AI company deal structures (February 2026), and Workpath’s agent conversation research (February 2026).

Victorino Group helps organizations design the governance architecture that makes production AI systems reliable and auditable. Let’s talk.