- Home
- The Thinking Wire
- Leaderboards Measure Pass Rate. They Hide Security Issues.
Leaderboards Measure Pass Rate. They Hide Security Issues.
The leaderboards everyone cites measure one thing: pass rate. Did the generated code satisfy the test? That number tells you the model can produce something functionally correct. It tells you nothing about whether enterprises can ship it.
Sonar ran the benchmark that fills in the rest. Prasenjit Sarkar presented the results at AI Engineer in May 2026: 4,444 Java assignments run across more than 53 models, every output scored by SonarQube for security, complexity, and maintainability. The full dataset is open at sonar.com/leaderboard. One caveat up front, and it matters: this is Sonar’s own benchmark, Java only, scored by Sonar’s own analyzer. Read the numbers as Sonar’s published evidence, not as a universal law. Even with that frame, the shape of the data is hard to ignore.
The Number on the Leaderboard Is the Wrong Number
Take Gemini 3.1 Pro High. On pass rate it looks excellent: 84.17%. That is the figure a buyer sees and trusts. Underneath, the same runs produced roughly 614 bugs per million lines of code, around 210 security issues per million lines, and a cyclomatic complexity of 234 across 307k lines generated for the set. The rank is real. So are the defects sitting under it.
This is the core problem with pass-rate leaderboards. They reward the answer and ignore the artifact. An enterprise does not pay for a correct answer. It pays for code that a team will read, extend, secure, and operate for years. Pass rate measures the first ten minutes. Security, complexity, and maintainability measure the next ten years.
We have argued before that producing code got cheap while verifying it did not. Sonar’s dataset is the third-party proof, model by model, that the bill comes due in security and complexity, not in functional correctness.
Verbosity Is Rising With Every Model Generation
The clearest trend in the data is volume. Newer models do not just write better code. They write more of it.
GPT-4o produced under 250k lines for the full assignment set. GPT 5.4, the same set, produced 1.2 million lines. That is roughly a five-fold increase in output for the identical task. Claude Sonnet 4.6 landed in between at 627k lines, and carried the highest security-issue rate in the comparison at around 300 issues per million lines.
More code is not a feature. Every line is a line someone has to review, a line that can hide a flaw, a line that raises the cost of the next change. When verbosity climbs with each model generation, the verification load climbs with it. The model that writes 1.2 million lines has handed your team 1.2 million lines to trust.
The Defects Are Getting Harder to See, Not Rarer
The most uncomfortable finding is not in any single rate. It is in the texture of the defects.
As models mature, the bugs and vulnerabilities they produce get finer. Earlier models failed loudly: code that did not compile, logic that was obviously wrong, the kind of error a reviewer catches in seconds. Newer models fail quietly. The code compiles, passes the test, reads cleanly, and carries a subtle injection path or a resource leak that a human reviewer will scroll right past.
This inverts the comfortable assumption that better models reduce verification work. They do the opposite. A loud bug is cheap to catch. A quiet vulnerability buried in clean, plausible code is expensive to catch, and the price of missing it shows up in production. Verification difficulty is rising, not falling, precisely because the models are improving.
Verification Has to Move Left, to Seconds
If the defects are finer and the volume is higher, the old loop breaks. Catching issues in CI, minutes after the code is written, means the developer has already moved on, the context is gone, and the fix is a context-switch tax. Multiply that across thousands of generated lines per day and CI becomes the new bottleneck.
Sonar’s proposed answer is the ACDC framework: guide, verify, solve. The part that matters operationally is the timing. SonarQube’s agentic analysis runs in one to five seconds before the commit, against one to five minutes in CI. That difference, seconds at the keyboard versus minutes in the pipeline, is the difference between a fix that happens in flow and a fix that gets deferred.
The remediation layer is the other half. A remediation agent makes exactly one fix per issue, then re-analyzes and recompiles. If the fix regresses anything, it is discarded. That discipline is the point: an agent that generates fixes without a verification gate just adds more unverified code to the pile. The gate, not the generation, is what makes it trustworthy.
Do This Now
Stop ranking models by pass rate alone. Before you standardize a coding model for your team, pull its security-issues-per-million-lines and complexity numbers from Sonar’s open leaderboard and weigh them against the pass rate. The model that wins your benchmark may be the one quietly handing your team the most to clean up.
Then move your verification left. If your only quality gate runs in CI, you are catching finer defects later than you can afford to. Put analysis in the pre-commit loop, where it runs in seconds and the developer still has the context to act. The generation speed is not your constraint anymore. Your verification speed is.
This analysis synthesizes Can LLMs generate Enterprise Quality Code? (Sonar, May 2026).
Victorino Group helps teams build verification that keeps pace with AI code generation. Let’s talk.
All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →
If this resonates, let's talk
We help companies implement AI without losing control.
Schedule a Conversation