The Noise Is the Signal: METR Data Shows AI Measurement Getting Harder, Not Easier

In February, we wrote about METR’s dependency problem: the organization could no longer maintain clean control groups because developers refused to work without AI. That was a measurement crisis at the experiment level.

New data from METR, reported by Timothy B. Lee in Understanding AI, reveals a deeper crisis. The measurement instruments themselves are breaking down. Benchmarks are saturating. Task-level noise spans ranges so wide that individual results are nearly meaningless. And the cost of establishing human baselines is becoming prohibitive.

The measurement problem is not improving with better methodology. It is getting structurally worse.

The Benchmark Ceiling

MMLU was the gold standard for general AI capability. Here is what happened to it:

GPT-3 (2020): 43.9%
GPT-4 (2023): 86.4%
GPT-4o (2024): 88.7%
GPT-4.1 (2025): 90.2%

The curve flattened. Not because models stopped improving, but because the test ran out of room. Research from Northhouse et al. (2024) found that approximately 6.5% of MMLU questions contain errors, putting the theoretical ceiling around 93%. The distance between GPT-4 and that ceiling is noise, not signal.

This is the benchmark paradox we identified in our earlier analysis: once a benchmark saturates, it stops measuring capability differences and starts measuring test artifacts. The community responded by creating harder tests. Humanity’s Last Exam (HLE) launched with o3-mini scoring 13.4%. Gemini 3.1 now leads at 44.7%. The pattern is already repeating. These benchmarks will saturate too. They always do.

METR’s Task Length Benchmark: Capability Explosion

METR’s task length benchmark measures something different: how long a task can an AI agent handle autonomously, measured in equivalent human programmer effort? The progression is striking:

GPT-3.5: ~30 seconds
GPT-4 (March 2023): ~4 minutes
o1 (December 2024): ~40 minutes
GPT-5 (August 2025): ~3 hours
Claude Opus 4.6 (February 2026): ~12 hours
Claude Opus 4.6 with CI (February 2026): 5 to 66 hours

The jump from 30 seconds to 12 hours in three years is extraordinary. But look at the last line. Claude Opus 4.6 with computer infrastructure scores somewhere between 5 and 66 hours. That is not a measurement. That is a confession that measurement at this scale does not work yet.

”Extremely Noisy” Is an Understatement

David Rein of METR said it plainly: “When we say the measurement is extremely noisy, we really mean it.”

Joel Becker, also at METR, made it concrete: “If we took one task out or added another, potentially instead of 14.5 hours, we’d measure 8 or 20 hours.”

Consider what that means. A single task addition or removal can swing the measured capability by nearly 2x. The benchmark is not measuring a stable property of the model. It is measuring the interaction between a specific model and a specific set of tasks, and that interaction is dominated by which tasks happen to be in the set.

This is fundamentally different from traditional software benchmarks, where adding one more test case does not double or halve the score. AI capability at the frontier is spiky. Models are brilliant at some tasks and incompetent at closely related ones. A benchmark that captures one spike looks transformative. The same benchmark shifted slightly looks mediocre.

The $8,000 Baseline Problem

Here is the part that makes the noise problem structural rather than solvable. To measure whether an AI can handle a 160-hour task, you need a human baseline. You need a human to actually do the task so you have a comparison point.

At professional rates, that costs over $8,000 per task. And you need multiple human baselines per task to establish variance. And you need dozens of tasks to get statistical significance.

METR is describing a measurement regime that costs hundreds of thousands of dollars per benchmark run. That is not a scalable methodology. As AI capability increases and task length grows, the cost of human baselines grows linearly with it. Eventually, establishing what a human can do becomes more expensive than building the AI system you are trying to measure.

This is the measurement version of the dependency problem we documented in February. In that analysis, developers could not separate AI-assisted work from non-AI work. Now, the measurement infrastructure cannot scale to match the capability it is trying to measure.

What Saturated Benchmarks Actually Tell You

When a benchmark saturates, organizations face a choice. They can treat the saturated score as evidence that the problem is solved (it is not). They can create a harder benchmark and restart the cycle. Or they can accept that benchmarks were never the right instrument for the decision they are trying to make.

Most AI procurement decisions rely on benchmark comparisons. Model A scores 90.2% on MMLU. Model B scores 88.7%. Model A wins. But if the ceiling is 93% and 6.5% of questions are wrong, the difference between 90.2% and 88.7% is statistically meaningless. You are comparing noise to noise.

The organizations making the best model selection decisions have already moved past benchmark-driven procurement. They run their own evaluation frameworks against their own tasks, with their own success criteria. That is expensive. It is also the only approach that produces actionable signal.

The Governance Implications

Three things follow from METR’s data.

First, vendor benchmark claims are becoming less informative, not more. When the difference between models on saturated benchmarks is smaller than the error rate in the benchmark itself, the numbers are decorative. Treat them accordingly. Any governance framework that relies on benchmark thresholds for procurement or risk decisions needs recalibration.

Second, capability is outrunning measurement. Claude Opus 4.6 can handle tasks equivalent to 12 hours of human programming effort. But the confidence interval on that number might span 4x. Organizations adopting frontier models are deploying capabilities they cannot precisely characterize. That is not necessarily a reason to stop deploying. It is a reason to invest in your own verification stack rather than relying on vendor-reported numbers.

Third, the cost of rigorous measurement is becoming a governance constraint. If establishing a human baseline for one task costs $8,000, most organizations will not do it. They will rely on vendor benchmarks, peer reports, or vibes. The gap between what rigorous measurement costs and what organizations are willing to spend is a governance vulnerability. It means most AI deployment decisions are made with less information than the decision-makers believe they have.

Measurement as Infrastructure

We have been tracking this trajectory since our first METR analysis. The pattern is consistent: each advance in AI capability degrades the instruments used to measure AI capability.

METR is doing honest, transparent work. They are publishing their uncertainty rather than hiding it. That transparency is valuable precisely because it reveals how much the rest of the industry is papering over.

The organizations that navigate this well will treat measurement as infrastructure, not as a one-time procurement exercise. They will build internal evaluation pipelines. They will define success criteria in terms of business outcomes rather than benchmark scores. They will budget for the cost of knowing what their AI systems actually do.

The noise is not going away. It is getting louder. The question is whether your governance framework accounts for it or pretends it does not exist.

This analysis draws on Timothy B. Lee’s reporting in Understanding AI (April 2026), incorporating METR task length benchmark data, MMLU saturation analysis per Northhouse et al. (2024), and Humanity’s Last Exam progression data. See also our earlier analyses: When AI Measurement Breaks: METR’s Dependency Problem, The Benchmark Paradox, and Your LLM Benchmark Score Is a Scaffold Artifact.

Victorino Group helps organizations build measurement infrastructure that produces actionable signal when industry benchmarks no longer do. Let’s talk.