The Harness Difference: When 42% Becomes 78% Without Changing the Model

Claude Opus 4.5 scored 42% on CORE-Bench. Then it scored 78%. Same model. Same benchmark. Same questions. The only variable that changed was the harness wrapping the model.

That is not a rounding error. It is an 85% relative improvement from scaffolding alone.

In Harness Engineering Is Not New, we covered the naming and history of this discipline. The argument was that the practices are older than the LLM era, even if the label is recent. This article is about something different: the accumulating empirical evidence that the harness determines performance more than the model does. Not as theory. As measured outcome.

The CORE-Bench Result

Sayash Kapoor and the Anthropic engineering team ran Claude Opus 4.5 through CORE-Bench, a benchmark for computational reproducibility in scientific research. The CORE-Agent scaffold produced 42%. Claude Code, Anthropic’s own harness, produced 78%.

Neither harness changed the model. Neither added training data. Neither modified weights. The difference was in how the harness managed context, structured tool calls, handled errors, and decided when to retry versus escalate. Scaffolding decisions.

Then they looked at the grading infrastructure. After fixing bugs in the evaluation harness itself, the score climbed to 95%. The model had been performing better than anyone measured, but the measurement tool was wrong.

Three numbers. 42%. 78%. 95%. The model contributed zero changes across all three. The harness produced a 36-point swing. The evaluation infrastructure produced another 17 points. If you are evaluating AI models by comparing benchmark scores, you are not comparing models. You are comparing harnesses.

The LangChain Experiment

CORE-Bench involved two different harnesses, which introduces confounding variables. LangChain’s Terminal Bench 2.0 ran a cleaner experiment.

They held the model fixed (GPT-5.2-Codex) and iterated on the harness alone. The starting score: 52.8%. The ending score: 66.5%. A 13.7-point improvement with zero model changes.

What did they change? Three things.

First, loop detection. Agents get stuck in repetitive cycles. The harness learned to detect when an agent was retrying the same failed approach and force a different strategy. This is not a model capability. It is a system capability. The model does not know it is looping. The harness does.

Second, local context middleware. Instead of dumping the full terminal output into the context window, the harness selectively surfaced relevant portions. Less noise, better reasoning. Again, not a model change. A plumbing change.

Third, compute budget tuning. This one is counterintuitive. Setting maximum reasoning tokens too high actually hurt performance. The model would reason itself into timeouts on problems it could have solved faster with less deliberation. The optimal compute budget was lower than the maximum available.

That last finding deserves emphasis. More thinking time made the model worse. The harness team had to constrain the model to improve its output. This runs counter to the assumption that giving models more resources always helps.

Vercel’s Subtraction

Vercel’s v0 agent went the opposite direction from what you might expect. They reduced their tool count from 15 to 2. Success rate went from 80% to 100%. Speed improved 3.5x, from 274 seconds to 77 seconds. Token usage dropped 37%. Total steps dropped 42%.

Their engineering team summarized it bluntly: “We were solving problems the model could handle on its own.”

This is the “less is more” finding that keeps appearing across agent engineering. Tool abundance creates decision overhead. The model spends tokens choosing between tools instead of solving the problem. Fewer tools means fewer wrong choices, faster execution, and less surface area for failure.

As we examined in Stripe’s Agentic Layer, Stripe reached a similar conclusion with their meta-tool pattern. Five hundred tools available, but the agent sees only 15 or 20 at any given step. Stripe added a selection layer. Vercel just deleted the tools. Different solutions, same underlying insight: the model performs better when you reduce what it has to think about.

SWE-bench Pro and the Context Retrieval Problem

SWE-bench Pro tested three products built on the same model (Opus 4.5) against 731 coding problems. Auggie scored 51.80%. Cursor scored 50.21%. Claude Code scored 49.75%.

The absolute spread is small. But it represents 15 to 17 problems where one harness solved what another could not. On identical model capabilities.

The primary differentiator was context retrieval. How does the harness find the relevant code in a large repository? Keyword matching? Semantic search? AST-aware traversal? Cursor, for example, uses Merkle tree content proofs for governed context access. Their indexing pipeline reduced p99 times from 4.03 hours to 21 seconds while achieving 92% organizational similarity in retrieved context.

The Merkle tree itself is not novel. Git uses the same data structure. IPFS does too. What Cursor did is apply it to context governance: a cryptographic proof that the agent is seeing the right files, not just files that match a keyword. That application to agent context selection is the contribution.

Stripe’s Evaluation Architecture

In our coverage of Stripe’s agentic layer, we focused on their production architecture: the Blueprint Engine, the Tool Shed, the DevBox infrastructure. Their benchmark work tells a different story.

Stripe ran an internal benchmark of 4 full-stack tasks. Opus 4.5 scored 92%. GPT-5.2 scored 73% on backend tasks. The average task required 63 turns of agent interaction.

Those numbers require context. Four tasks is a tiny sample. The tasks were self-selected by the team that built the system. The grading was self-assessed. Call it what it is: a team measuring its own work with its own ruler.

But the interesting contribution is not the score. It is the methodology. Stripe built full-stack evaluation tasks that test end-to-end integration, not isolated function completion. Their observation that “a mostly correct integration is a failure” reflects production reality better than most benchmarks. A payment flow that works 95% of the time is broken. Stripe’s evaluation infrastructure is designed to catch that remaining 5%.

The benchmark improved their product. That is the real output. Not the score itself, but the process of building evaluations rigorous enough to expose harness weaknesses.

Where the Harness Stops Mattering

APEX-Agents exists to remind us that this story has limits.

APEX tests frontier-difficulty tasks. The best score anyone has achieved is 24%, regardless of harness sophistication. No amount of scaffolding, tool design, or context engineering has pushed past that ceiling.

For frontier-complexity work, the model is still the bottleneck. Harness engineering cannot compensate for capability the model does not have. The 42-to-78 story applies to problems within the model’s capability range where scaffolding determines how much of that capability gets expressed. Beyond that range, better harnesses produce the same failures with better logging.

This distinction matters for investment decisions. If your tasks are in the middle-complexity band where current models can solve them given proper support, invest in the harness. If your tasks are at the frontier, you are waiting for the next model generation. Model-generation jumps (GPT-4 to GPT-5, for example) deliver 20 to 40 point improvements. Harness improvements deliver 10 to 15. Both matter. Neither replaces the other.

The Commercial Interest Problem

Every data point in this article comes from a company selling something.

Anthropic published the CORE-Bench result. They sell Claude. LangChain published Terminal Bench 2.0. They sell LangGraph and LangSmith. Vercel published the v0 results. They sell v0. Stripe published their benchmark. They are hiring. Cursor published their indexing metrics. They sell Cursor.

None of this invalidates the findings. But it should calibrate confidence. These organizations selected which results to publish. They chose benchmarks that showcase their products. The harness improvements are real, but the magnitude may be optimistic. We are seeing the highlight reel, not the blooper reel.

The strongest data point is LangChain’s Terminal Bench, precisely because it is the most controlled. One model, iterated harness, public methodology. If you want to believe one number in this article, believe 13.7 points.

The Transitional Question

There is a real tension in the “invest in harnesses” argument.

Manus, the AI agent startup, refactored their harness five times in six months. Not because the harness was bad. Because the underlying models improved and the harness had to adapt. Heavy frameworks built around model limitations become liabilities when the model sheds those limitations.

Loop detection middleware is valuable today because models get stuck in loops. If the next model generation handles loops natively, that middleware becomes dead code. Context windowing is critical because context windows are finite. If context windows grow 10x, the windowing logic needs rewriting.

This does not mean harness engineering is wasted effort. It means harness engineering is software engineering, with all the maintenance and evolution that implies. The harness is not a one-time build. It is a living system that co-evolves with the models it wraps.

The organizations that treat the harness as infrastructure (maintained, versioned, tested) will adapt. The ones that treat it as a fixed artifact will find themselves refactoring from scratch every six months. Manus learned this the expensive way.

What the Data Says

Five independent teams. Different models, different benchmarks, different commercial incentives. All converging on the same conclusion.

The model is not the product. The harness is the product.

42% becomes 78% from scaffolding. 52.8% becomes 66.5% from system prompts and middleware. 80% becomes 100% from removing tools. p99 indexing drops from 4 hours to 21 seconds from better retrieval architecture.

For tasks within the model’s capability range, the harness determines somewhere between 30% and 85% of the observed performance. The model provides the ceiling. The harness determines how close you get to it.

If you are evaluating AI products by comparing model names, you are measuring the wrong variable. If you are building AI products by upgrading models without upgrading harnesses, you are leaving the largest performance lever untouched.

The model does the work. The harness decides how well.

This analysis synthesizes Sayash Kapoor’s CORE-Bench analysis (Anthropic, February 2026), LangChain Terminal Bench 2.0 (LangChain, February 2026), Vercel v0 engineering blog (February 2026), Stripe Minions Part 1 and Part 2 (February 2026), and SWE-bench Pro results (March 2026).

Victorino Group builds the harness and governance layer that turns model capability into production performance. Let’s talk.