- Home
- The Thinking Wire
- Harness Engineering Is Subtraction: Anthropic's Own Talk Shows the Scaffolding Shrinking
Harness Engineering Is Subtraction: Anthropic's Own Talk Shows the Scaffolding Shrinking
In March 2026 we wrote about Anthropic’s generator-evaluator harness, the sprint contracts, the context resets, the three-role pattern. The post is required prior reading for this one. Read it first: Generator-Evaluator Loops. What follows assumes you have.
In May 2026, at the AI Engineer Conference, Ash Prabaker and Andrew Wilson of Anthropic’s applied AI team walked through what their harness looks like two months later. The headline is not what they added. It is what they removed.
Between Opus 4.5 and Opus 4.6, three things in their own harness got deleted. Forced sprint decomposition, gone. Fresh context windows per sprint, gone. Per-sprint evaluator runs, gone. The hours-long agent did not get worse. By their internal METR-style benchmark, Sonnet 3.7 in February 2025 sustained roughly one hour of coherent agent work under a minimal harness. Opus 4.6 in early 2026 sustains roughly twelve hours under the same minimal baseline. Twelve times the runtime, fewer moving parts in the scaffolding.
That is the discipline this post is about. Harness engineering is subtraction. Most teams are still adding.
The Curve the Model Is Climbing
The shape of the curve matters before the deletions make sense. Wilson presented the timeline as a series of paired releases, where each new model arrived alongside a harness primitive that the previous model could not have supported.
Sonnet 3.5 brought artifacts and computer use. Sonnet 3.7 brought the Claude Code research preview. Opus 4 and Sonnet 4 turned Claude Code into a generally available product with an SDK. Sonnet 4.5 added context-window awareness, Claude Code 2.0 shipped with checkpoints, and the SDK was renamed the Agent SDK. Opus 4.5 introduced many-sub-agent orchestration, with the model positioned as planning-grade. Haiku 4.5 paired with Opus 4.5 to make multi-sub-agent runs economically viable. Opus 4.6 and Sonnet 4.6 brought server-side compaction, 1M-token context as a generally available feature, and the agent-teams primitive.
Each step did two things. It made the model more capable of holding state and intent over longer horizons. It also moved capabilities that used to live in the harness down into the model or the platform.
That second move is the one that changes how you build. When server-side compaction handles context maintenance, your harness no longer needs to schedule context resets. When the model can hold two-hour continuous builds without losing the plot, your planner no longer needs to force a sprint decomposition that exists only to keep the model from drifting. The scaffolding was load-bearing for an earlier model. It is now in the way.
What Anthropic Deleted, and Why
Prabaker was specific about which pieces of the March harness no longer survive in May.
The first deletion was forced sprint decomposition. In the original generator-evaluator design, the planner broke work into bounded sprints because the generator could not maintain coherence across a longer arc. Opus 4.6 can. The team now allows continuous builds of two hours or more without artificially cutting the model’s working session into pieces.
The second deletion was the fresh-context-per-sprint pattern. The original harness archived accumulated context at each sprint boundary and restarted the generator with a clean window plus the sprint contract. Server-side compaction, which arrived with the 4.6 generation, does the equivalent job without requiring the harness to drive it. The orchestration code that managed those resets is gone.
The third deletion was the per-sprint evaluator run. In March, the evaluator ran at every sprint boundary, validating against the sprint contract before the generator was allowed to proceed. The current harness runs the evaluator once, at the end of a one-shot generation. The generator produces a complete artifact against a negotiated contract. The evaluator grades it once.
Each of these deletions removed code, removed cost, removed orchestration complexity, and did not regress quality. That is the test for any harness primitive. If the latest model can absorb the scaffold’s job, the scaffold has earned its way out.
What Survives, and What That Tells You
Three primitives did not get deleted. They are the ones worth understanding, because the absence of deletion is itself a signal.
The planner-generator-evaluator role separation is still there. It is a critic-reviewer pattern with explicit role contracts, not the GAN analogy that the prior post already corrected. The roles persist because the bias of any model grading its own output persists. A model evaluating its own work still misses the same categories of error that produced the work. The fix is structural separation, not better self-reflection.
The file-system as shared state survived. Agents read and write to disk. Disk is the protocol. The team did not move to a richer state-passing abstraction, and the reason is the same reason filesystems beat custom storage in most engineering contexts. You can list it, grep it, audit it, and run it through any other tool. The harness primitive that wins is usually the one that imposes the least new vocabulary.
Contract negotiation between generator and evaluator survived. In their Retro Forge example, the contract had 27 explicit criteria, the run cost about 200 US dollars, and it ran for six hours. The contract was negotiated before any code was generated. This is the artifact that does the heavy lifting. The contract is what the generator builds against and what the evaluator scores against. Without it, you are back to vibes.
There is a quieter primitive that also survived and deserves a paragraph of its own. The evaluator uses an explicit rubric to grade subjective qualities. Anthropic’s example uses a 4-axis rubric: design, originality, craft, functionality. The rubric is calibrated against reference sites. Subjective does not mean ungradable. It means the grading scheme has to be made explicit and external. The rubric is the load-bearing object, not the model’s taste.
The Debugging Loop Nobody Wants to Hear About
Prabaker said something on stage that contradicts most of the literature on agent observability. The primary debugging loop for the harness builder is reading agent traces by hand.
Not dashboards. Not automated trace-analysis pipelines. Not LLM judges grading their own runs. A person sits down, opens the trace, and reads what the agent did and why. The team explicitly rejected the idea of fully automated trace analysis as the primary loop, because automated systems have the same bias as the agents they grade. They miss the same things.
This is uncomfortable advice because it does not scale linearly. You cannot hire a hundred trace readers and call it production. The point of saying it out loud is to set the expectation correctly. Traces are how you understand the system. You can build telemetry on top of trace reading, but you cannot skip it. Teams that try to skip directly to automated trace summarization end up with a confident dashboard sitting on top of a misunderstood system.
The practical implication for engineering leadership is simple. Budget for a small number of trace readers on every team operating long-running agents in production. Make them senior. Make it part of the on-call rotation. The trace is the truth, and someone has to keep reading it.
The Subtraction Discipline
The thesis of this post lives in one sentence. Every harness primitive you ship has an expiration date, and your job as a harness engineer is to delete it before it becomes a tax on the next model generation.
Most teams do not work this way. They add. The harness grows new layers of orchestration, new sub-agent roles, new context-shaping middleware, and the additions stay forever. The team that wrote them is reluctant to remove them because they shipped them, the on-call runbook references them, and the regression tests pass with them in place. Meanwhile, the model has absorbed half of what they do.
The Anthropic team has organizational permission to delete its own code because deletion is part of how they evaluate their own harness. That permission is not exotic. Any platform team can grant it. The mechanism is a quarterly audit. Every quarter, take the current harness, list every primitive in it, and ask whether the current model still needs it. If the answer is “no” or “I am not sure,” run the harness without that primitive on your benchmark suite and compare. If quality holds, the primitive goes.
The audit is the loop. The model improves; the harness shrinks; the audit catches what the model absorbed; the shrunk harness frees engineering attention for the next frontier task that needs new scaffolding. The total investment in harness engineering does not decrease, but the location of that investment moves forward with the frontier.
Do This Now
Pick the harness around one production agent in your stack. Open the orchestration code. Find one primitive that was load-bearing when you wrote it: a context reset, a forced decomposition, an evaluator gate, a planner step that the current model could plausibly skip.
Run your evaluation suite with that primitive removed. If quality holds, delete it. Keep the deletion in a separate commit so you can revert if a future regression surfaces. Do this once a quarter for every long-running agent you operate.
If the deletion regresses quality, you have learned something useful: that primitive is still load-bearing for your specific workload, and the next model generation is where it earns its way out. Mark it, watch it, and re-audit when the next major model lands.
This is the discipline. Add when the frontier task demands it. Delete when the model has absorbed it. The harness that does its job correctly is always smaller next quarter than it was this one.
This analysis synthesizes Build Agents That Run for Hours (Ash Prabaker and Andrew Wilson, Anthropic, AI Engineer Conference 2026), How we built our multi-agent research system (Anthropic Engineering, 2025), and the earlier Victorino analysis Generator-Evaluator Loops.
Victorino Group helps engineering teams audit their agent harness for scaffolding that the latest model has already absorbed. Let’s talk.
All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →
If this resonates, let's talk
We help companies implement AI without losing control.
Schedule a Conversation