The AI Control Problem

Half Your Benchmarks Are Wrong: What Happens When AI Measures Itself

TV
Thiago Victorino
9 min read
Half Your Benchmarks Are Wrong: What Happens When AI Measures Itself

Three things happened in March 2026 that, separately, look like routine AI news. Together, they describe a structural collapse in how enterprises evaluate AI systems.

METR published a study showing that benchmark scores overstate real-world AI performance by 24 percentage points. Anthropic demonstrated that infrastructure configuration alone swings benchmark scores by 6 points. And OpenAI acquired PromptFoo, the most widely used independent AI testing tool.

Each finding has nuance. The convergence does not. The verification stack that enterprises rely on to make AI procurement decisions is consolidating under the control of the organizations being evaluated.

The 24-Point Overstatement

METR, the nonprofit formerly known as ARC Evals, ran an experiment that nobody else had bothered to run. They took 296 AI-generated pull requests and sent them to four active maintainers of three open-source repositories (scikit-learn, Sphinx, pytest). These maintainers reviewed the PRs the way they review any contribution: against their project’s actual standards.

The results were stark. Maintainer merge rates averaged 24.2 percentage points lower than what automated benchmark graders reported (standard error: 2.7). The automated grader said the code passed. The humans who maintain the codebase said it did not.

Why? Code quality was the top rejection category. Style violations. Non-compliance with repository conventions. The kind of failures that a pattern-matching grader cannot detect because they require understanding what “good” means in a specific project context.

For perspective: the human-written patches in the same dataset had a 68% merge rate from the same maintainers. The benchmark wasn’t just optimistic. It was measuring something different from what maintainers care about.

Claude Sonnet 4.5 showed roughly a 7x overstatement by the automated grader when comparing 50-minute and 8-minute evaluation horizons. METR’s own language is careful: benchmark scores “may lead one to overestimate how useful agents are” without iteration or human feedback.

The coverage is limited. 95 of 500 issues (19%) across 3 of 12 repositories (25%) in SWE-bench Verified. METR acknowledges this. But the direction of the finding is unambiguous, and the magnitude is large enough that even significant methodological corrections would not eliminate it.

We covered the contamination problems with SWE-bench when OpenAI retired its own benchmark in February. That article focused on training data leakage. METR’s study reveals a different, complementary problem: even when the benchmark is not contaminated, automated grading inflates scores because it cannot evaluate code the way human maintainers do. Contamination and grading inflation are independent failure modes. Both are operating simultaneously.

The 6-Point Infrastructure Noise

While METR was demonstrating that scores overstate capability, Anthropic was quantifying a different source of unreliability: the infrastructure running the benchmark.

On Terminal-Bench 2.0, the difference between the most and least resourced evaluation setups was 6 percentage points (p < 0.01). Same model. Same benchmark. Same questions. Six points of variance from server configuration alone.

Infrastructure error rates tell the story. Under strict resource enforcement, 5.8% of evaluation runs failed due to infrastructure problems. Give the model 3x the default resources and that drops to 2.1%. Remove resource limits entirely and it falls to 0.5%.

Anthropic identified a critical threshold at 3x resources. Below that line, additional resources fix reliability problems. Above it, resources enable entirely new problem-solving strategies. A model that cannot hold its full context in memory will approach a problem differently than one that can. The benchmark score reflects which strategy the infrastructure permitted, not which strategy the model is capable of.

The implication for leaderboard comparisons is direct. Anthropic’s own recommendation: “Leaderboard differences below 3 percentage points deserve skepticism until eval configuration is documented.” Most leaderboard differences are below 3 percentage points. Most eval configurations are not documented.

This is fixable. Anthropic published specific recommendations for standardizing infrastructure. But “fixable” and “fixed” are different words. Until evaluation infrastructure is standardized and disclosed, benchmark comparisons between models are comparing configurations as much as capabilities.

The Testing Tool That Changed Owners

PromptFoo was the closest thing the industry had to an independent AI evaluation layer. Over 350,000 developers used it. More than 130,000 were active monthly. A quarter of the Fortune 500 depended on it in production environments.

In March, OpenAI acquired it.

The open-source project will continue. OpenAI has said so explicitly. But governance changed hands. The tool that enterprises used to evaluate OpenAI’s models is now owned by OpenAI. The tool that compared GPT against Claude against Gemini now reports to one of the contestants.

This is not unprecedented. Industries routinely see independent measurement capabilities absorbed by the entities being measured. What makes this case notable is the timing. It happened in the same month that independent research demonstrated benchmark scores are unreliable, and that infrastructure noise makes model comparisons suspect.

The a16z investor involved in the deal, Zane Lackey, described PromptFoo as helping “organizations find and fix AI risks before they ship.” That value proposition does not disappear because the tool changed owners. But the incentive structure around it does.

A testing tool owned by a lab has different priorities than a testing tool accountable only to its users. The difference may be subtle. In measurement, subtle matters.

The Consolidation Pattern

Each of these developments has caveats.

METR’s coverage is partial. Their authors say so. Anthropic’s infrastructure noise is quantified and addressable. PromptFoo’s open-source license does not evaporate because of an acquisition.

But the pattern is what matters for enterprise decision-making.

The benchmark scores that procurement teams rely on overstate performance by 24 points. The infrastructure running those benchmarks introduces 6 points of undisclosed variance. And the primary independent tool for verifying AI behavior just moved inside a lab. These are not three problems. They are one problem with three faces: the verification infrastructure that enterprises need to make informed decisions is either unreliable, undisclosed, or no longer independent.

As we explored in AI Verification Debt, the structural deficit is not that organizations lack access to AI capabilities. It is that they lack the infrastructure to verify what those capabilities actually deliver. That argument was about code quality and trust. The March data extends it to the evaluation layer itself.

What Enterprises Should Do

The organizations that will handle this well share a common trait: they stopped trusting external scores and started building internal measurement.

Build your own evaluation suite. Not a benchmark. An evaluation process that tests AI output against your codebase, your standards, your definition of “good enough.” METR proved that generic automated grading misses what domain-specific human review catches. Your maintainers are your best graders. Use them systematically.

Demand infrastructure disclosure. When a vendor quotes a benchmark score, ask what infrastructure ran it. Memory allocation, timeout settings, resource limits. If they cannot answer, the score is meaningless. Anthropic’s research gives you the specific questions to ask.

Diversify your testing tools. If your AI evaluation pipeline depends on a single tool, you have a single point of failure. That was true before PromptFoo’s acquisition. It is more urgent after it. Run multiple evaluation frameworks. Keep at least one that is not owned by a model provider.

Measure operational outcomes, not synthetic scores. As we argued when METR’s productivity experiments hit structural limits, the question “how good is this AI model?” is becoming unanswerable through benchmarks. The question “what does this AI model produce in our environment?” is answerable through operational data. PR acceptance rates. Post-merge defect rates. Maintainer review time. These numbers are harder to collect and impossible to game.

The Uncomfortable Arithmetic

Here is the math that should concern every CTO making an AI procurement decision in 2026.

Published benchmark score: 75%. Subtract 24 points for automated grading inflation (METR). Subtract up to 6 points for infrastructure variance (Anthropic). Effective range of real-world performance: somewhere between 45% and 75%.

That is not a margin of error. That is a range wide enough to make the original number meaningless.

The labs producing these models are doing genuine, often impressive engineering work. The capabilities are real. But the measurement systems that enterprises use to compare, select, and justify these tools are broken in ways that have now been independently quantified.

You cannot fix this by finding a better benchmark. You fix it by building your own verification infrastructure. The question is whether your organization considers that a priority or a nuisance.

Every month you delay, you accumulate more verification debt on tools whose actual performance you have never independently measured. That debt compounds. It always does.


This analysis synthesizes METR’s SWE-bench maintainer study (March 2026), Anthropic’s infrastructure noise research (March 2026), and PromptFoo’s OpenAI acquisition announcement (March 2026).

Victorino Group helps enterprises build verification infrastructure that doesn’t depend on the labs being verified. Let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation