Software's Centaur Era Just Started. Measure the Team, Not the Model.

In May 1997, Deep Blue beat Garry Kasparov. The headline was that machines had passed humans at chess. The longer story, the one that ran for the next two decades, was different. For roughly twenty years after the match, the strongest entity at the board was neither a human nor an engine. It was a human paired with an engine: a centaur. The pair beat the engine alone and crushed the human alone. That era ended only recently, when engines finally outgrew even guided play.

Richard Marmorstein’s essay Software’s Centaur Era argues, persuasively, that software just entered the same window. Coding agents today cannot sustain long-horizon work without a human at the wheel. Left alone, they drift, hallucinate context, and produce code that compiles but does not belong in the system it was meant to serve. Steered, they move faster than either party could alone. We are in the centaur years, and the centaur years tend to last longer than anyone expects.

If that framing is correct, and we think it is, the implication for governance is the part nobody is talking about loudly enough. The measurement question stops being about the model. It becomes about the pair.

What “centaur” means as a unit of work

The chess analogy was never about chess. It was about a class of problems where the engine has tactical depth the human lacks and the human has long-horizon judgment the engine lacks. Software, today, fits that shape almost exactly. An agent can churn through a thousand candidate refactors, hold the syntax tree in working memory, and write the bash invocation faster than you can spell it. It cannot, reliably, decide which of those refactors matters next quarter. It does not know which abstraction your team will regret in six months. It cannot tell you when to stop.

The human in the centaur supplies exactly those things: the stopping rule, the architectural taste, the institutional memory, the relationship with the person who will own the code at 3am. The agent supplies throughput and recall. Either one alone is a worse engineer than the pair.

This sounds like a feel-good framing until you try to measure it. The moment you ask “how productive is the agent,” you are asking the wrong question, because the agent is not the unit of production. The pair is. A measurement architecture that tracks agent output without tracking human steering is measuring half a centaur and calling it a horse.

The bar for energy-saving tools is higher than for time-saving tools

Marmorstein puts a finer point on this with a constraint that deserves to be quoted everywhere governance teams gather. The bar for a tool that saves you energy is higher than the bar for a tool that saves you time.

A time-saver only has to be net-faster than the alternative. You tolerate friction because the wall clock won. An energy-saver has to feel like the human is doing less cognitive work, not more, after the tool enters the loop. Most coding agents today save time and burn energy. The developer babysits the output, re-reads the diff, runs the tests, holds the architectural picture in their head because the agent does not, and finishes the day more tired than they started. The hours look good on the report. The human looks ground down by Friday.

This is why “productivity gain” measured in time-to-merge is misleading. If your agent shaves 30% off cycle time but the developer is now doing the mental work of two people, the centaur is broken. The pair is not faster in any sense that compounds. It is faster in a sense that erodes. By the end of the quarter, your team’s best engineers are the ones quietly turning off the agents, because for them the centaur math went negative two months ago and nobody was measuring the right axis.

The governance implication: any agent rollout that does not instrument human cognitive load alongside agent throughput is flying blind on the variable that determines whether the pair is sustainable.

Why “control the AI” is the wrong frame

Most current governance literature treats the agent as the thing to constrain. Guardrails, sandboxes, identity floors, permission models. Necessary, all of them. Sufficient, none of them. They answer the question “what can the agent not do.” They do not answer the question “is the pair working.”

You can have a perfectly contained agent operating inside a perfectly safe environment and still have a broken team. The agent does not break the production database. The human burns out by month three because the pair was never sized correctly: too many agent threads per human, no clear stopping rule, no architecture for handing context back to the operator, no measurement of when the operator is overloaded.

The control conversation is mature. The measurement conversation is barely started. We have written about adoption gaps, where the question is whether organizations are using AI at all (see AI Eats the World 2026). We have written about On the Loop, Not In the Loop, where the question is what role the human should occupy in agent operations. Those framings stand. The centaur framing builds on top of them: once you have decided humans are on the loop, you still have to decide whether the loop, as a pair, is producing more than the sum of its parts. That requires measuring the pair, not the parts.

What measuring the team looks like

Concretely, a centaur-aware measurement architecture has three layers.

The first layer is agent throughput, which most teams already track: tasks completed, PRs raised, tests authored, lines of code generated. This is the visible half of the pair. It is necessary and insufficient.

The second layer is human cognitive load. This is the layer that almost no production deployment instruments today. Useful signals: time spent reviewing agent output versus producing it, frequency of context switches per hour, ratio of agent-initiated changes to human-initiated changes, self-reported energy at end of week. The goal is not to surveil. The goal is to know when the centaur is asking too much of its human half, so you can fix it before the human quietly opts out.

The third layer is pair output, which is what the business actually cares about. Did the work product improve? Did defects go down? Did time-to-value shrink at constant or lower human energy cost? This is where the time-saver-versus-energy-saver distinction lives. A pair that ships faster but exhausts its human is a pair that will dissolve. A pair that ships faster while preserving energy is a pair that compounds.

A team measured only on layer one will optimize for agent activity. A team measured only on layer three will not know which lever to pull when things go wrong. The three layers together let you ask the right diagnostic question: which half of the centaur is the bottleneck this week, and what do we change to rebalance?

What the centaur era is not

Two things this framing does not promise.

It does not promise that the era lasts forever. Chess engines eventually outgrew guided play. Coding agents probably will too, on some workloads, on some horizon. The honest position is that nobody knows how long the window stays open. Twenty years would not be surprising. Five would not be surprising either. The right posture is to build for the centaur years while watching for the signal that they are ending.

It does not promise the centaur is always the right answer. There are tasks where pure human work is faster, and there are tasks where pure agent work is good enough. The centaur is the right unit for the long-horizon, judgment-heavy, taste-laden work that defines most production software engineering. It is not the right unit for one-off scripts or for high-volume low-stakes generation where review overhead exceeds the work itself.

The centaur framing is a default, not a universal. The work is figuring out where it applies and instrumenting it when it does.

Do this now

Pick one team that is running coding agents in production. Spend 45 minutes with them. Ask three questions: how do we measure agent throughput today, do we measure human cognitive load at all, and what does the pair produce that neither half would produce alone? If you cannot answer the second question with anything more specific than “they say it feels okay,” you are operating a centaur without a dashboard for half of it. Build the missing half this quarter, before your best engineers quietly decide the math no longer works.

The centaur years are good years. They reward teams that take the pair seriously as the unit of work. They punish teams that keep measuring the model and ignoring the rider.

This analysis synthesizes Software’s Centaur Era (Richard Marmorstein, May 2026).

Victorino Group helps teams design measurement architectures for human-plus-AI work where both halves of the centaur count. Let’s talk.