Capability Is a Commodity. Scaffolding Is the Moat.

TV
Thiago Victorino
9 min read
Capability Is a Commodity. Scaffolding Is the Moat.
Listen to this article

For two years, procurement meetings about AI have revolved around a single question: which model should we buy? Claude or GPT? Open-weights or frontier? Every vendor deck, every RFP, every board slide assumes the answer lives inside the model.

In the first week of April 2026, three independent research teams published papers that have nothing to do with each other. A university lab in Berkeley broke benchmarks. A security startup found bugs in OpenBSD. An Arxiv group audited LLM routers. Different authors, different agendas, different domains.

Read them on the same afternoon and the message is impossible to miss. Model capability is a commodity. The moat is everything around the model.

The Berkeley Paper: Benchmarks as Theater

Hao Wang and colleagues at UC Berkeley’s Center for Responsible Decentralized Intelligence did something that, in hindsight, someone had to do. They took the eight most respected agent benchmarks in the industry and tried to cheat on them. Not to improve scores. To see if cheating was possible.

It was. At rates that embarrass the field.

Terminal-Bench: 100% exploit rate. SWE-bench Verified: 100%. SWE-bench Pro: 100%. WebArena: roughly 100%. FieldWorkArena: 100%. CAR-bench: 100%. GAIA: 98%. OSWorld: 73%.

And the word the authors use for their results is the word that matters: zero legitimate task solutions. Their exploits did not involve solving the problems well. They involved not solving the problems at all. Reading the answer out of the evaluator’s memory. Shipping the test suite with the solution. Calling eval() on strings the benchmark expected to sandbox. Injecting prompts into LLM judges that had no input sanitization.

The paper catalogs what the authors call the “seven deadly patterns.” Shared environments between the agent and its evaluator. Test data visible to the code being tested. Unsafe evaluation functions. Answers stored adjacent to prompts. LLM judges with no input hardening. It reads like an OWASP Top Ten for a field that never wrote an OWASP Top Ten because nobody thought to.

This is not a hypothetical problem. The paper documents IQuest-Coder-V1 inflating its benchmark scores by 5.2% through git log exploitation. In the real world. On a model people paid attention to.

Here is the thing to sit with. When a company decides to buy “the best coding agent,” they are looking at leaderboards. Those leaderboards are ranking systems whose upper bound was shown, in April 2026, to be almost entirely forgeable. The question “which model is best” is downstream of a measurement apparatus that does not work.

The AISLE Paper: Orchestration Beats Parameters

On a different continent, a small security firm called AISLE published a blog post with a title that sounds like marketing and a methodology that does not. “AI Cybersecurity After Mythos: The Jagged Frontier.”

AISLE uses AI to find vulnerabilities in open source code. Since mid-2025, they have reported more than 180 externally validated CVEs. The current state of the art, at the frontier lab level, is Anthropic’s Claude running what the industry calls the Mythos protocol. AISLE’s headline claim: they can match Mythos on bug discovery using a 3.6 billion parameter open-weights model, at $0.11 per million tokens.

The details are where the argument lives. AISLE recovered the full exploit chain for a 27-year-old OpenBSD bug using a 5.1 billion parameter open-weights model. Not a frontier model. Not Claude. Not GPT. An open model two orders of magnitude smaller than what most buyers assume is required for security work.

The metaphor AISLE uses is worth stealing. “A thousand adequate detectives searching everywhere will find more bugs than one brilliant detective.” The brilliant detective is the model. The thousand adequate detectives are the scaffolding, the orchestration layer, the evaluator design, the search policy, the retry logic, the context management, the decomposition of the problem into pieces a small model can actually solve.

What AISLE is demonstrating, with numbers, is that the interesting engineering does not happen inside the model weights. It happens outside them. And when the outside is done well, the inside becomes interchangeable.

The Arxiv Paper: The Plumbing Is Broken

The third paper is the one nobody was looking for. A team on Arxiv studied 428 LLM API routers — the middleware layer that sits between an application and whichever foundation model it talks to. Routers are plumbing. Routers are boring. Routers are where you put rate limiting and load balancing and the occasional fallback model.

Here is what the paper found.

One paid router and eight free routers were actively injecting code into the traffic they proxied. Seventeen routers were accessing AWS credentials they had no business touching. One router was draining cryptocurrency wallets. Not theoretically. In production. On real customer traffic.

And here is the sentence from the paper that should stop every CTO in their tracks: zero of the studied providers enforced cryptographic integrity between the client and the upstream model.

Zero. Not “most failed.” Not “many had gaps.” Zero of them could prove that the bytes the application sent were the bytes the model received, or that the bytes the model returned were the bytes the application got back. The supply chain of AI is running on the honor system.

This matters because of a pattern the industry has been quietly normalizing. “We use Claude via a router” is a sentence that, until this paper, sounded like a minor implementation detail. After this paper, it is a sentence that describes a completely unaudited trust boundary. The model you think you are paying for is not necessarily the model you are getting. The output you think the model produced is not necessarily the output the model produced.

The Thesis

Now put the three papers next to each other.

Benchmarks that rank models can be forged at near-100% rates. So the evidence buyers use to choose models is, in many cases, fiction.

A 3.6 billion parameter model can match a frontier model at finding real vulnerabilities, given the right scaffolding. So the evidence that bigger models are necessary is, in many cases, also fiction.

The router layer that connects applications to models has no integrity guarantees, and some of it is openly malicious. So the evidence that “using Claude” means your application talks to Claude is, in many cases, also fiction.

If you believe any one of these papers, the “which model” question gets weaker. If you believe all three, the question stops making sense.

The value in AI systems is migrating, visibly, to the layer around the model. The verification harness that decides whether output is real. The evaluator that catches its own exploits. The orchestration that makes a small model find things a big model misses. The router integrity that guarantees the bytes you received are the bytes the model sent. The supply chain assurance that nobody is mining crypto on your credentials.

These are not exotic pieces of infrastructure. They are the boring parts. They are the parts that do not appear in product demos. They are also, as of April 2026, the parts that separate AI systems that work from AI systems that only appear to work.

What to Buy

We have written before about verification debt, about the verification revolution, about how the verification stack is consolidating, and about the security architecture gap in agent systems. Those pieces argued, from different angles, that the governance layer was becoming the product.

This week’s three papers collapse the argument into one sentence. Capability is a commodity. Scaffolding is the moat.

The buyers who continue to evaluate AI vendors by model name are running RFPs against a fiction. The buyers who evaluate by harness design, evaluator robustness, orchestration quality, router integrity, and supply chain assurance are evaluating the things that actually determine whether the system works.

The Waymo safety verification gap exists in autonomous driving because the industry measures the wrong thing. The Berkeley paper just demonstrated that the AI agent industry measures the wrong thing too. The AISLE paper demonstrated that what you do with a mediocre model matters more than which great model you pick. The Arxiv paper demonstrated that the plumbing below all of this has no integrity at all.

Three papers, one week, one message. Stop asking which model. Start asking which harness.


This analysis synthesizes Hao Wang et al., “How We Broke Top AI Agent Benchmarks” (UC Berkeley RDI, April 2026); AISLE, “AI Cybersecurity After Mythos: The Jagged Frontier” (April 2026); and “LLM Supply Chain” (Arxiv, April 2026).

Victorino Group helps buyers evaluate AI systems by the scaffolding, not the model. Let’s talk.

All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation