Your LLM Benchmark Score Is a Scaffold Artifact. Here Is What Actually Matters.

MMLU has 14,042 questions. About 6.49% of them are wrong.

That number comes from MMLU-Redux, a systematic audit of the most cited LLM benchmark in existence. In the Virology section, 57% of questions were flagged as having errors. More than half. The benchmark that launched a thousand leaderboard comparisons has been grading models against incorrect answers for years.

This is not news. Researchers have known about MMLU’s quality problems since 2024. What is news is that we now have a rigorous technical framework for understanding why benchmarks fail and what to do about it. Cameron R. Wolfe’s survey of LLM benchmark construction, published in March 2026, provides the anatomy. The findings reshape how organizations should think about evaluation.

We have explored how vendor self-benchmarking produces systematically inflated scores and documented the contamination crisis that forced OpenAI to retire its own coding benchmark. Those pieces identified that benchmarks are broken. This one is about the mechanics of building better ones.

The Scaffold Is the Variable

Epoch AI published a finding that should have ended most benchmark debates immediately: switching the evaluation scaffold (the code that formats prompts, parses responses, and scores answers) causes up to 15% performance swings on SWE-bench Verified. Same model. Same questions. Different wrapper code. Fifteen percent.

To appreciate what this means, consider that the difference between frontier models on most benchmarks is 2 to 5 percentage points. Organizations making procurement decisions based on leaderboard rankings are comparing signal that is smaller than the noise introduced by implementation choices.

Practitioners have suspected this for a while: governed implementation matters more than model selection. The way you deploy, prompt, and integrate a model contributes more variance than the model itself. A 15% scaffold swing dwarfs the difference between GPT-5 and Claude Opus on any given benchmark.

Epoch AI also found that API provider instability adds additional variance. The same model accessed through different API endpoints on different days can produce different scores. The benchmark is not measuring the model. It is measuring the entire evaluation stack, and most of that stack is invisible to the person reading the leaderboard.

Goodhart’s Law Has a Name Now

Benchmarks are targets. When a measure becomes a target, it ceases to be a good measure. Charles Goodhart said this in 1975. Fifty years later, the LLM community has produced a textbook illustration.

IFEval measures instruction-following: can the model produce output with exactly five paragraphs, include a specific keyword, or end with a particular sentence? TULU-3-8B, an open model, scores 82.4% on IFEval. Impressive. Then IFBench arrived with harder, more diverse instruction-following tasks. TULU-3-8B dropped to 28.9%.

From 82.4 to 28.9. Same capability claim. Same model. A different test that the model hadn’t been optimized against.

The problem is not TULU-3-8B. The problem is evaluation design. Models (and the teams training them) optimize for the benchmarks that exist. IFEval became the standard instruction-following benchmark, so models were tuned to excel at IFEval-style tasks. The capability improvement was real but narrow. It generalized to IFEval and almost nowhere else.

Organizations evaluating models based on IFEval scores were measuring benchmark fitness, not instruction-following capability. The distinction matters when the model encounters instructions that differ from the benchmark format.

The 70% Blindness Problem

Vision-language model benchmarks have an even more fundamental defect. DatBench, a quality auditing framework for VLM datasets, found that up to 70% of questions in some visual benchmarks are “blindly solvable.” The model can answer them correctly from the text alone, without looking at the image.

Think about what this means. A benchmark designed to measure visual reasoning is largely measuring text comprehension. A model could ignore every image and still pass most questions. The leaderboard is grading the wrong capability.

DatBench also found that 42% of spatial reasoning data was mislabeled or ambiguous. Nearly half. The benchmark asks “is the red ball to the left of the blue cube?” and the ground-truth answer is wrong, or the question is genuinely ambiguous.

These are not edge cases. They are the majority of the dataset. An enterprise evaluating VLMs for document understanding or visual inspection based on these benchmarks is making decisions on contaminated data. The scores tell you how well the model handles text questions paired with irrelevant images.

There is a circularity problem here too. DatBench uses frontier VLMs as quality judges to filter bad data points. The models being evaluated are cousins of the models doing the filtering. This is not disqualifying, but it means the quality floor is bounded by the blind spots of current frontier models.

IRT: Measurement Science Meets LLM Evaluation

Item Response Theory has been used in educational testing for decades. The SAT, GRE, and TOEFL all use it. The core idea: not all questions are equally informative. A question that every model gets right tells you nothing about relative capability. A question that every model gets wrong also tells you nothing. The most informative questions are the ones that discriminate between models of different ability levels.

IBM Research applied IRT to MMLU and produced tinyBenchmarks. One hundred carefully selected anchor items achieve less than 2% estimation error compared to the full 14,042-item benchmark. That is a 140x reduction in evaluation cost with nearly identical measurement precision.

For enterprises running model evaluations, this is not an incremental improvement. Full benchmark runs on frontier models consume significant compute. Running a 100-item evaluation instead of a 14,000-item evaluation changes the economics of continuous model assessment. You can evaluate weekly instead of quarterly. You can test against your own distribution of difficulty instead of accepting the benchmark’s distribution.

There is a limitation worth naming. IRT assumes stable latent traits. Human test-takers have a consistent ability level that IRT estimates. LLMs do not behave this way. They have discontinuous capability jumps across domains and high sensitivity to prompt formatting. A model that excels at virology questions formatted as multiple choice may fail the same questions presented as open-ended prompts. IRT’s statistical assumptions hold approximately, not perfectly, for LLMs.

This matters less than it might seem. IRT’s practical value (identifying which items are informative) survives even when the theoretical assumptions are violated. The 140x cost reduction is empirically validated, not just theoretically predicted.

Fluid Benchmarking: Evaluation That Adapts

The static benchmark model, where every model answers every question, is wasteful. Most questions in a benchmark provide no information about a given model because they are too easy or too hard. Fluid Benchmarking, built on Fisher information from IRT, selects questions adaptively based on what the evaluation has already learned about the model.

Computerized adaptive testing already works this way for the GRE. You answer a hard question correctly, and the next question is harder. You answer incorrectly, and the next question is easier. The test converges on your ability level efficiently, without wasting questions that are uninformative.

Applied to LLMs, Fluid Benchmarking means smaller, cheaper evaluations that are more precise. But it also means evaluation infrastructure becomes a system, not a script. You need item banks with calibrated difficulty parameters. You need an adaptive selection algorithm. You need infrastructure to track which items have been exposed to which models to prevent contamination.

Real engineering. And the kind most organizations have not invested in because benchmarks were supposed to be someone else’s problem. They are not.

SMART Filtering: Cleaning What You Have

Not every organization can build adaptive evaluation infrastructure from scratch. SMART filtering offers a more accessible improvement: clean your existing benchmark.

SMART (contamination and redundancy detection) reduced a benchmark dataset by 48% while improving its correlation with ChatBot Arena rankings. Half the questions were redundant or contaminated, contributing noise rather than signal. Removing them produced a smaller, more accurate benchmark.

In practical terms: if you are using any public benchmark as an input to model selection, the raw score is probably less reliable than a filtered version of the same benchmark. As we explored in Evaluation-Driven Development, evaluation infrastructure is governance infrastructure. SMART filtering is one way to upgrade it without starting from zero.

What This Means for Enterprises

All of this research points to a clear operational conclusion. Organizations need to stop consuming benchmarks passively and start building evaluation infrastructure actively.

Scaffold governance matters more than model selection. If implementation choices cause 15% performance swings, then your deployment architecture, prompt templates, and integration code are the primary variables. Governing these is more impactful than choosing the model with the highest leaderboard score. Run your evaluation with your scaffold, on your data, measuring your outcomes.

Static benchmarks decay. Goodhart’s Law is not theoretical. IFEval to IFBench is a 54-point drop on the same capability claim. Any benchmark you adopt today will be gamed tomorrow. Build evaluation pipelines that refresh tasks, rotate difficulty, and detect when scores improve without corresponding capability gains.

IRT is practical, not just academic. A 140x cost reduction in evaluation is the difference between measuring occasionally and measuring continuously. Continuous measurement catches regressions that quarterly evaluations miss. If you are evaluating models at all, IRT-calibrated item selection should be your default approach.

Audit your benchmarks before trusting them. The DatBench finding (70% of VLM questions answerable without images) is an extreme case, but every benchmark has quality issues. MMLU has a 6.49% error rate. If your evaluation includes questions with wrong answers, your scores are measuring the wrong thing. SMART filtering or manual auditing is not optional.

The Infrastructure Thesis

Benchmarks are not broken. They are tools built for research comparison, pressed into service for enterprise procurement decisions they were never designed to support.

The research community is solving the measurement science problem. IRT, Fluid Benchmarking, SMART filtering, DatBench. These are real advances in how to measure model capability accurately and efficiently.

Enterprises face a different problem. Not “how do we build a better MMLU.” It is “how do we build evaluation infrastructure that tells us whether this model works in our context, at our scale, with our constraints.” That requires treating evaluation as an engineering discipline with the same rigor as monitoring, testing, and security.

The 15% scaffold swing. The 54-point Goodhart drop. The 70% blind-solvability rate. The 6.49% error rate. These are not criticisms of benchmarks. They are specifications for evaluation infrastructure that an organization must build if it wants to make trustworthy decisions about AI.

The benchmark score on the leaderboard is an artifact of someone else’s scaffold, someone else’s item selection, and someone else’s quality threshold. The only score that matters is the one you measure yourself.

This analysis synthesizes Cameron R. Wolfe’s The Anatomy of an LLM Benchmark (March 2026), Epoch AI’s Why Benchmarking Is Hard (2025), and Anthropic’s Demystifying Evals for AI Agents (2025).

Victorino Group helps organizations build evaluation infrastructure that measures what actually matters. Let’s talk.