When the Harness Changes and the Model Does Not

Between February and April 2026, the internet convinced itself Anthropic had quietly nerfed Claude. The forensic evidence says otherwise. The product changed. The weights did not.

The best public artifact is GitHub issue #42796, filed by AMD Senior Director Stella Laurenzo on April 2. She analyzed 6,852 Claude Code sessions, 17,871 thinking blocks, and 234,760 tool calls. Her headline number: the read-to-edit ratio collapsed from 6.6 to 2.0 between late January and early April. Claude Code was still Claude Code. It just stopped reading files before mutating them, and it started rewriting entire files where it used to surgically edit. User interrupts per thousand tool calls jumped roughly twelvefold.

Marcus Schuler at Implicator.ai published the adult read of this on April 15: there is no evidence the model weights shifted. The deterioration was produced by the harness. Effort defaults, adaptive thinking, cache duration, context compaction, quota policy, and status incidents are all vendor-controlled knobs that move without a model-version bump. Claude Code lead Boris Cherny confirmed the mechanism in public replies. Adaptive thinking, introduced February 9, sometimes allocated zero reasoning tokens, which is how you get what Cherny called precise hallucinations: fake commit SHAs, non-existent packages, made-up API versions. The default effort dropped from high to medium on March 3. The redact-thinking beta header rolled from 0% to 100% of sessions between March 5 and March 12.

This is not a story about whether Anthropic lied. They did not. The documentation exists. Cherny’s replies are substantive and technical. If anything, Anthropic is the current transparency ceiling in this market, better than OpenAI or Google on the equivalent primitives. The story is uglier than dishonesty. The product a customer bought on day one and the product they are running on day ninety are no longer the same artifact, and the customer has no contractual or technical instrument to detect the difference.

That is a buyer-side governance problem.

The Contract Surface Is Missing

Every governance conversation I have had with enterprise buyers over the last eighteen months framed the vendor as the accountable party. Can we trust the model? Can we audit the training data? Can we get an indemnity clause? These questions are not wrong. They are insufficient.

In The Harness Difference, we showed that the same Claude Opus model scored 42% and 78% on CORE-Bench depending on the scaffolding wrapping it. Not a rounding error. A 36-point swing with zero model changes. In Claude Managed Agents: Harness Governance as a Vendor Product, we argued that when the vendor owns the harness, the vendor owns the behavior. Implicator’s April 15 piece is the empirical proof.

Three properties make the buyer’s position acute:

The model name is marketing, not a contract artifact. “Claude Opus 4.6” on April 1 and the same string on April 10 produce measurably different behavior because effort defaults, thinking budget policy, compaction thresholds, and cache TTL all move server-side.

Telemetry is asymmetric. The vendor sees every token, cache hit, and thinking block. The customer sees outcomes plus a 200 OK. When a third-party benchmark claimed an 83.3% to 68.3% accuracy collapse, there was no standardized primitive for the customer to reproduce the measurement at the same effort level. (Schuler himself flags that benchmark as methodologically broken. The point is the missing instrument, not the number.)

Transparency is treated as cost, not commitment. Anthropic’s September 2025 postmortem is the best disclosure artifact any AI vendor has produced: three infrastructure bugs, up to 16% of Sonnet 4 requests degraded at peak. Yet even Anthropic’s own engineers admitted that internal privacy controls prevented them from reading the problematic interactions. If the vendor’s own debuggability is gated, the customer’s is structurally worse.

Production AI procurement without a harness-change SLA is the 2026 equivalent of SaaS procurement without an uptime SLA in 2012.

Ten Harness Primitives That Change the Product

These are the surfaces where product behavior shifts without model changes. Every one is vendor-controlled today. Every one should be governed tomorrow.

Effort default. Low, medium, high, max. Dropped high to medium on March 3, then raised back to high for enterprise tiers on April 7.
Adaptive thinking. Model self-allocates reasoning budget per turn. Sometimes allocates zero.
Cache TTL. Five-minute default versus one-hour premium. Cache misses are treated as production incidents internally.
Context compaction. Automatic mid-session summarization. No customer tool to inspect what got compacted away.
Thinking redaction. The redact-thinking-2026-02-12 header. UI-only per the vendor, but customers cannot independently verify.
Quota policy. Rate limits, message caps, per-tier session budgets. Surfaces as errors with ambiguous root cause.
Status-page incident thresholds. The August 2025 routing bug affected 16% of requests at peak while the status page showed operational.
Tool defer-loading and plan-mode gating. Architectural decisions that shape which behaviors are cache-cheap and which are expensive.
Prompt caching prefix preservation. Every product decision bends to avoid cache invalidation. Invisible to the buyer.
System prompt drift. Server-rendered prompts and tool definitions change without notice, unless a leak surfaces them.

For the mechanical foundation, see Harness Engineering Is Not New and What Is an Agent Harness. What is new is not the engineering. It is the realization that these ten surfaces form the product.

The Five-Surface Audit Framework

A buyer who cannot see weights, logs, or server config can still govern the output surface. Imperfectly, but not at zero. Here is what governed procurement should require:

1. Behavioral Canary Suite (customer-owned). A fixed set of deterministic task prompts, not benchmarks, run daily against the vendor endpoint. Measure tool-call count, read-to-edit ratio, thinking-token burn, outcome correctness. Laurenzo’s methodology, productized. This is the only signal the customer fully owns.

2. Harness-Change SLA. Contractual commitment from the vendor to notify customers before changing effort defaults, adaptive-thinking policy, cache TTL, compaction thresholds, or system prompts. Modeled on cloud-provider maintenance-window SLAs. Without this clause, the model name is not a stable product identifier.

3. Effort-Parity Clause. Contractual right to pin effort level, disable adaptive thinking, and lock thinking-budget policy. Anthropic technically already offers this (CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1, /effort max). Defaults, not capabilities, are what customers actually get. Procurement should force the non-default to become the contract default for regulated workloads.

4. Incident Disclosure Floor. Automatic customer disclosure when vendor-internal incidents degrade more than X% of requests for more than Y minutes, including sampling ID ranges. Anthropic’s September 2025 postmortem is the template. It needs to become the floor, not the ceiling.

5. Audit-Log Right. Customer right to retrieve, per-request, the actual effort level served, thinking tokens consumed, cache-hit status, and system-prompt hash. Today this is a vendor-internal telemetry asset. In governed procurement, it becomes a buyer artifact.

The governance-engineering reflex will want to automate this. We wrote about that in When the Harness Engineers Itself. The audit framework is what the meta-harness regulates. Start with the contract.

What Procurement Language Should Actually Say

A buyer who closes a deal today can still take action tomorrow. Redline your next AI vendor contract with four sentence patterns.

Stop writing “Vendor shall provide access to Model X.” Start writing “Vendor shall provide access to Model X operated under the harness configuration documented in Schedule A, with thirty days written notice of any change to effort defaults, thinking-budget policy, cache TTL, context-compaction thresholds, or system-prompt content.”

Stop writing “Vendor will maintain reasonable performance.” Start writing “Vendor will maintain the behavioral canary thresholds defined in Schedule B, measured daily, with breach remediation within forty-eight hours.”

Stop writing “Vendor represents the Service is suitable for production use.” Start writing “Vendor will disclose within one business day any internal incident degrading more than three percent of Customer requests for more than thirty minutes, including affected request-ID ranges.”

Stop writing “Customer acknowledges Service may be updated from time to time.” Start writing “Customer retains the right to pin effort level, disable adaptive reasoning, and lock thinking-budget policy for all production traffic, at the pricing tier specified in Schedule C.”

The contracts exist. Mayer Brown’s February 2026 analysis on agentic-AI contracting names the gap and proposes audit rights, outcome-based SLAs, and decision-log obligations. The clauses are writable. The vendors are not yet offering them. They will, once buyers stop signing the current boilerplate.

The Adult Take

Implicator’s piece headlines “too dark.” The headline overreaches. Anthropic is not hiding things. They document adaptive thinking, effort levels, caching, and compaction better than any peer. They publish postmortems. They patch and communicate. The body of Schuler’s article admits as much. The honest framing is harder to sell: the ceiling of vendor transparency in April 2026 is still below the floor a regulated enterprise needs.

This is not a problem you solve by shaming one vendor. You solve it by shifting the contract surface across the category. As we argued in The AI Trust Gap Is Not Closing, trust gets harder as capability grows. And as we covered in Growth Is Now a Trust Problem, customer trust collapses the moment the product identity becomes unstable. Laurenzo’s 6,852 sessions are what instability looks like in telemetry.

The harness is the product. The contract should name it. The audit surface should measure it. Anything less is procurement theater.

This analysis synthesizes Claude Probably Wasn’t Secretly Nerfed — Anthropic Made the Black Box Too Dark (April 2026), GitHub issue #42796 by Stella Laurenzo (April 2026), Anthropic’s Postmortem of Three Recent Issues (September 2025), Anthropic’s Adaptive Thinking documentation, and Mayer Brown’s Contracting for Agentic AI Solutions (February 2026).

Victorino Group helps enterprise buyers design harness-audit contracts for AI products that are no longer just models. Let’s talk.