The Harness Is a 90-Day Artifact: Why Both Ends of the Stack Are Eating It

TV
Thiago Victorino
8 min read
The Harness Is a 90-Day Artifact: Why Both Ends of the Stack Are Eating It

The agent harness your platform team is hardening this quarter has a useful life measured in model generations, not years. Two pieces of evidence published this month make that uncomfortable to ignore.

Han Lee’s Hidden Technical Debt of AI Systems: Agent Harness reports benchmark numbers that should reorder how you think about harness investment. On the same GPT-5.1 Codex generation, the first-party harness scored 20.2% on posttrain evaluations, third-party harnesses scored 7.7%. That is a 2.6x gap on the same model. Letta Code on Opus 4.5 reaches 59.1% against Claude Code’s 41.6% on memory-heavy evaluations. Flip to Gemini 3 and the picture changes again: Letta 56.0%, provider 58.4%. The harness advantage narrows as the model matures.

Drew Breunig published the other half of the picture three days later. In The Cost of Overfitting the Harness, he traces a quieter pattern: frontier labs are absorbing harness behaviors into the model itself. OpenAI wound down fine-tuning APIs in the same window where it baked previously external scaffolding into base models. Mario Zechner documented his GPTs degrading on workflows that used to work, the symptom of behaviors moving from the harness layer into the model’s training distribution. Breunig calls the result the “naked robotic core”: a model that no longer needs the scaffolding because the scaffolding became part of the model.

Both writers, working independently, arrive at the same conclusion from opposite ends. The harness is being eaten. From one end by the frontier labs that bake behaviors into the model. From the other end by third-party builders like Letta whose first-party-quality harnesses prove the scaffolding still pays off, but only for the current model generation. The agent harness you ship today is a 2026 artifact, not a platform.

The implication is not “skip the harness.” The implication is “design the harness as throwaway scaffolding with thin, swappable interfaces.”

The Compression Wave

Lee’s benchmark table is the cleanest piece of evidence I have seen for what was previously a hunch. Same model, different harness, three times the score. The arxiv paper Lee cites (2603.08640v1) measured first-party Codex against third-party harnesses on identical posttrain tasks. The 20.2% to 7.7% spread is not noise. The first-party harness knows the model’s training distribution. The third-party harness is reverse-engineering it.

Letta’s data shows the same effect from the other side. A small team that builds a memory-native harness can outperform a frontier lab’s first-party offering on memory-heavy tasks, by 17 percentage points on Opus 4.5. The team that knows agent-specific architecture beats the team that knows the model.

Now layer the Gemini 3 result on top. Letta 56.0%, provider 58.4%. The harness advantage that was decisive on Opus 4.5 has compressed to a 2.4-point disadvantage on a newer model generation. Read across the table and a pattern emerges: the more mature the model, the less the harness matters. The first generation of any new model rewards heavy scaffolding. The third generation rewards thin scaffolding. The fifth generation absorbs the scaffolding into itself.

Breunig’s framing makes this concrete. The frontier labs are not competing with harness builders by accident. They are running the same playbook OpenAI ran with fine-tuning: ship the capability externally, watch how customers use it, bake the patterns into the model, retire the external surface. Fine-tuning is gone, or close to it. Custom GPTs degraded because their workflows were absorbed. The harness is next.

What “Throwaway” Actually Means

The mistake teams make is reading “throwaway” as “do not invest.” That is wrong. The right reading is “invest where the investment compounds, not where it ossifies.”

Three things compound across harness generations: governance policy, evaluation harnesses, and identity. Three things do not: prompt scaffolding, tool wrappers, and memory schemas. The first list survives a model swap. The second list gets rewritten on the next major release.

This is why the four-floor containment building survives every harness rotation. Compute isolation, data containment, knowledge governance, and identity federation are not harness features. They are platform features. The harness sits inside the building. When the harness is replaced, the building stands.

Treat that distinction as the architectural commitment. Anything that lives inside the building, prompt templates, tool definitions, memory schemas, retry policy, can be swapped on a 90-day cadence with the next model release. Anything that is the building, IAM federation, audit logs, sandboxing primitives, data policy, must outlive the model.

The Split-Harness Pattern

There is a way to operationalize “throwaway with thin interfaces”: split the harness into three roles and let each evolve at its own cadence.

The production harness is the smallest possible surface. It enforces governance policy, isolates compute, gates data access, and federates identity. It does not know which model it is talking to. It speaks a thin contract: take this input, return this output, log everything in between. When a new model ships, the production harness does not change. The model swap is a configuration change.

The training harness is where exploration lives. It is fat, opinionated, and disposable. It is the place where you wrap tools, design prompts, structure memory, and experiment with the agentic patterns the model rewards. The training harness is rewritten every model generation. That is the point. It is the place where you discover what the model actually does well, before that knowledge gets baked into the next model generation and you have to rewrite again.

The eval harness is the bridge. It runs the same task suite across model versions, harness versions, and configurations. It is the only piece that survives every rotation, because it has to. The eval harness is how you decide when to swap the training harness, when to swap the model, and when the absorption pattern Breunig describes has reached the behavior you used to depend on.

Lee’s benchmarks only exist because someone built an eval harness. Letta’s claim that they beat Claude Code on Opus 4.5 only carries weight because both ran the same memory evaluation suite. Without the eval harness, you cannot tell whether the new model regressed, the new harness regressed, or your workload changed. You are making decisions in the dark.

What This Costs You If You Get It Wrong

Two failure modes recur in teams that treat the harness as a platform.

The first is the team that builds a heavy harness and tightly couples production policy to it. When the next model ships and the harness needs a rewrite, the policy gets rewritten too. The audit trail breaks. The governance team loses six months relearning what was true. Compliance officers stop trusting the platform because the platform keeps changing underneath them.

The second is the team that under-invests in the eval harness because it does not feel like product work. The first time a new model degrades a workflow that used to work, this team has no instrumented way to prove it. They argue from anecdotes. They roll back, or worse, they roll forward and discover the regression in production. Zechner’s GPTs are the consumer-facing version of this story. The enterprise version is more expensive.

Both failure modes have the same root cause: confusing the harness with the platform. The harness is the part that gets eaten. The platform is what remains when the harness is gone.

Do This Now

Spend one meeting this month with your platform team and answer three questions in writing.

First, which parts of your current harness encode governance policy versus which parts encode model-specific scaffolding? If you cannot draw the line on a whiteboard in 10 minutes, the two are tangled. Untangle them before the next model ships.

Second, do you have an eval harness that runs the same task suite across at least two model versions? If not, you cannot detect the absorption pattern when it hits your stack. Build the eval harness this quarter. Treat it as platform work, not product work.

Third, what is your migration plan when the model you depend on releases a successor that does not need your current harness scaffolding? “We will figure it out” is not a plan. “We will rewrite the training harness, keep the production harness, and rerun the eval suite to confirm parity” is a plan.

The teams that win the next two years of agent operations are not the ones with the most sophisticated harnesses. They are the ones whose harnesses are small enough to throw away.


This analysis synthesizes The Cost of Overfitting the Harness (dbreunig.com, May 2026), Hidden Technical Debt of AI Systems: Agent Harness (Han Lee personal blog, May 2026), and Letta Code benchmarks (Letta, May 2026).

Victorino Group helps engineering organizations split production, training, and eval harnesses so model generations can rotate without breaking governance. Let’s talk.

All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation