- Home
- The Thinking Wire
- Terminal-Bench 2.0: The Harness Is the Procurement Decision Your CIO Isn't Making
Terminal-Bench 2.0: The Harness Is the Procurement Decision Your CIO Isn't Making
Procurement teams evaluating AI coding tools ask the wrong question first. They ask which model is inside. The model is the variable everyone knows how to compare: parameter counts, context windows, leaderboard ranks, vendor fact sheets. The harness around the model is treated as plumbing, an implementation detail handled by whoever is deploying the tool.
Nicholas Bustamante’s Model-Harness-Fit thread, surfaced by TLDR AI this week, makes that ordering indefensible. The harness is not plumbing. The harness is the product. And as of Terminal-Bench 2.0, there is finally a benchmark that proves it with numbers a CFO can read.
The 4.5-Point Swing
Per Bustamante’s analysis, Terminal-Bench 2.0 ran Claude Opus 4.6 against two different harnesses. ForgeCode wrapped it: 79.8 percent. Capy wrapped it: 75.3 percent. Same weights. Same prompts. 4.5 points of difference, attributable entirely to the scaffolding around the model.
Cursor, per the same thread, jumped from “Top 30 to Top 5” on a harness change alone.
If you have spent any time on procurement spreadsheets, you know what 4.5 points means. It is the difference between “we can deploy this for production code review” and “we keep it in advisory mode.” It is the line between a tool a CTO can defend in front of a security committee and one that gets sent back for another evaluation cycle. Frontier model selection rarely produces 4.5-point swings within the same generation. The harness, apparently, can do it on a Tuesday.
Why the Harness Moves the Number
The mechanism is post-training. Per Bustamante’s read of how Codex CLI, Claude Code, and GitHub Copilot CLI are built, frontier labs do not ship raw weights and let the harness adapt. They post-train the model against a specific harness. The tool names the model is taught to call, the schema of those tool calls, the citation tags it learns to emit, the memory rituals it expects to perform, the system prompt structure it has been optimized against. All of these are baked into the weights.
Swap the harness and you have removed the substrate the weights were trained against. The model still works. It works less well. The 4.5 points is the cost of the mismatch.
The implication for a buyer is not subtle. When a vendor says “we use Claude Opus 4.6 inside,” they are telling you the model name. They are not telling you whether the harness around it was the harness that model was post-trained against, a competing harness, or a generic wrapper a junior engineer assembled in a weekend. Those three cases produce three different products with the same model card.
The Procurement Reframe
Treat the harness as a first-class evaluation criterion. That sentence sounds unobjectionable. In practice, almost no procurement process does it. The standard rubric covers model name, context window, pricing, SOC 2, data residency, and integration surface. The harness shows up as “implementation detail” or, in the worst cases, as “vendor’s secret sauce we cannot disclose.”
A harness-aware procurement asks different questions:
Which harness is the model post-trained against? If the vendor cannot answer, that is the answer. Model-harness fit was not a design constraint.
On which benchmark, with which harness, did the published number come from? A 79.8 means nothing without the harness it ran inside. Insist on the pair.
If we replace the harness in our deployment, does the vendor’s quoted accuracy still apply? Most enterprise deployments wrap the vendor’s tool inside their own MCP server, their own context-window manager, their own memory layer. Each wrapper is, formally, a partial harness change. The vendor’s number was measured against their harness, not yours.
Who controls the system prompt, the tool schema, and the citation format inside the harness? If those are vendor-controlled and undocumented, you are operating with hidden coefficients in your evaluation.
What is the harness change cadence? A vendor that ships harness updates monthly is shipping a different product monthly. The model name on the contract does not change. The behavior does.
The Cursor Case
Cursor’s “Top 30 to Top 5” jump is the most legible piece of the thread for a non-technical buyer. A single product, the same underlying model relationships, moved 25 ranks on a benchmark by changing the harness. If you bought Cursor before that change, you bought a Top 30 product. If you bought it after, you bought a Top 5 product. The vendor name was the same. The model name was the same. The procurement team that read only the model spec missed the entire delta.
The lesson does not require taking sides on which harness is better. The lesson is that “Cursor” as a procurement object is not a stable artifact. It is a model plus a harness, and the harness moves. Any AI coding tool the vendor names is the same kind of object, whether they say so on the data sheet or not.
What This Means for the CIO
Three concrete changes for buyers this quarter.
Stop accepting model names as product names. When a vendor says “powered by GPT-5” or “uses Claude Opus 4.6,” your follow-up is “in which harness, post-trained how, with what tool schema.” If the vendor cannot answer at the level of the harness, they have not made model-harness fit part of their engineering. That is a risk signal.
Demand the benchmark pair. Any quoted accuracy number must come with the harness it was measured under. “Claude Opus 4.6 at 79.8 percent on Terminal-Bench 2.0 with ForgeCode” is a claim. “Claude Opus 4.6 at 79.8 percent” is marketing. Treat the second as unfalsifiable until paired.
Audit your own harness. As we argued in Harness Audit: The Buyer-Side Governance Layer, most enterprises have already built half a harness without realizing it. The MCP servers, the context managers, the prompt templates your platform team layered on top of the vendor’s tool are part of the harness now. Your effective accuracy is whatever the model achieves inside your composite harness, not whatever the vendor measured. Run the benchmark on your stack. The number you get back is the number that matters.
The teams that win the next procurement cycle are not the ones who pick the best model. They are the ones who understand that the question “which model” was never the question. The question was always “which model, in which harness, post-trained how, evaluated where, deployed against what.” The Terminal-Bench 2.0 numbers are the first time the harness coefficient shows up cleanly enough that a non-technical executive can see it. Use them.
The vendors who post-trained for model-harness fit will publish the pair. The vendors who did not will publish only the model name. That asymmetry, on its own, is now a procurement signal.
This analysis draws on Nicholas Bustamante’s Model-Harness-Fit thread (cited via TLDR AI, May 2026).
Victorino Group helps procurement teams treat the harness as a first-class evaluation criterion alongside the model. Let’s talk.
All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →
If this resonates, let's talk
We help companies implement AI without losing control.
Schedule a Conversation