McKinsey Measured the Wrong Thing

In November 2025, McKinsey surveyed 300 senior executives about AI’s impact on software development. The headline: organizations report 16-45% improvements in productivity, quality, and delivery speed. The methodology: asking executives what they believe is happening.

That is the entire study. Three hundred people answered a questionnaire about their perceptions. McKinsey packaged those perceptions as findings.

Compare that to what controlled measurement actually shows.

What the Numbers Say When Nobody Is Asking Executives

METR, a research organization focused on AI evaluation, ran the only randomized controlled trial of AI coding assistance published to date. Experienced developers completed real tasks with and without AI tools. Result: developers were 19% slower with AI assistance. They believed they were 24% faster.

The perception mismatch was not small. It was directionally wrong. Developers did not slightly overestimate their gains. They reported speed improvements while experiencing slowdowns.

The NBER’s February 2026 survey of 6,000 executives across industries found that more than 80% report zero measurable productivity gains from AI adoption. Not small gains. Zero. This is a sample twenty times larger than McKinsey’s, asking about actual measured outcomes rather than perceived improvements.

LinearB’s 2026 Engineering Benchmarks analyzed 8.1 million pull requests across 4,800 engineering teams. AI-generated pull requests have a 32.7% acceptance rate. Human-written pull requests: 84.4%. Two out of three AI pull requests get rejected. That is measurement at scale, not a survey.

Sonar’s 2026 State of Code report surveyed 1,149 developers. 96% do not fully trust AI-generated code. Only 48% always verify it before committing. The code is being produced, pushed, and deployed without the verification that developers themselves say it needs.

What McKinsey Gets Right

Before deconstructing the methodology, credit where it is due. The McKinsey report identifies five organizational factors that separate companies reporting higher AI impact: upskilling programs, measurement practices, change management, end-to-end implementation, and AI-native role design. These are directionally sound.

The report quotes Sonar CEO Tariq Shaukat: “Measure outcomes, not adoption.” That is exactly correct. It is also the opposite of what McKinsey’s own survey does, which measures executive perception of outcomes rather than the outcomes themselves.

The framing matters too. McKinsey positions AI developer productivity as an “operating model” problem, not a tool problem. Organizations that just hand developers AI tools and wait see minimal returns. Organizations that restructure workflows, roles, and measurement see more. This validates what governed implementation looks like in practice.

And Artificial Analysis data shows the underlying technology genuinely improving. Benchmark performance roughly doubled in a year. The tools are getting better. The question is whether organizations can tell the difference between better tools and better results.

The Self-Report Problem

McKinsey’s methodology has a specific failure mode. Executives who championed AI adoption are asked whether AI adoption is working. The career incentive is obvious. Nobody who lobbied the board for a seven-figure AI investment reports back that it produced nothing. The survey design selects for optimism.

This is not speculation. It is precisely what METR measured. Developers using AI tools reported feeling faster. The stopwatch said otherwise. If individual developers cannot accurately assess their own productivity with AI, why would their executives do better from several organizational layers away?

The report’s showcase example reveals the bias in action. Cursor, an AI-native code editor, is presented as a productivity success story. Cursor is a $29.3 billion AI-native startup built from the ground up around AI workflows. Using Cursor to represent enterprise AI adoption is like using Formula 1 pit crews to benchmark automotive manufacturing. The capability is real. The transferability is not.

The Missing Risk Analysis

McKinsey’s report contains no mention of AI code quality risks. None. In a report about AI’s impact on software development, the word “vulnerability” does not appear.

Veracode’s 2025 analysis found that 40-48% of AI-generated code contains security vulnerabilities, across over 100 LLMs tested. Stack Overflow’s 2025 survey of 49,000 developers found that 66% spend more time fixing “almost-right” AI code than they would have spent writing it from scratch. These are not edge findings from obscure sources. They are the largest surveys in the industry, and McKinsey either missed them or chose not to include them.

A report that measures AI developer productivity without measuring AI code defect rates is measuring half the equation. It is like reporting a factory’s output without counting the returns.

The Commercial Circularity

McKinsey’s five recommendations for capturing AI value are: invest in upskilling, build measurement frameworks, implement change management, deploy end-to-end transformation, and create AI-native roles. These are McKinsey’s five consulting service lines, listed in order.

This is not unique to McKinsey. Every consulting firm’s research conveniently identifies problems that require the firm’s services to solve. But it is worth naming, because the research is cited as independent evidence when it functions as a sales document. The survey finds that organizations need exactly what McKinsey sells. The circularity should inform how much weight you give the conclusions.

The Perception Mismatch Is the Governance Problem

Here is the argument that McKinsey’s data accidentally supports better than any controlled study could.

If executives perceive 16-45% improvements while controlled measurement shows neutral-to-negative results, the problem is not that AI fails to deliver value. The problem is that organizations cannot distinguish AI enthusiasm from AI value. They lack the measurement infrastructure to know.

When an organization cannot tell whether a technology is helping or hurting, every decision about that technology is a guess. Scaling decisions. Hiring decisions. Investment decisions. Security decisions. All built on perception data that the only randomized controlled trial in the field directly contradicts.

That is not a measurement error. It is a governance vacuum. The absence of measurement infrastructure means the absence of accountability, the absence of course correction, and the absence of any mechanism to convert AI potential into verified AI outcomes.

What Measurement Infrastructure Requires

Four capabilities separate organizations that know what AI is doing from those that believe what AI is doing.

Outcome measurement, not adoption tracking. Stop counting how many developers use AI tools. Start counting AI-generated PR acceptance rates, defect rates by code origin, time-to-detection for AI-introduced bugs, and verification ratios. LinearB’s 32.7% acceptance rate is the kind of number that should be on every engineering dashboard. Most organizations cannot produce it.

Quality gates calibrated to AI failure patterns. Veracode documented the vulnerability profiles. METR documented the speed-accuracy tradeoff. Stack Overflow documented the “almost right” phenomenon. These failure modes are known and characterizable. Automated verification can catch them, but only if the verification infrastructure exists and is calibrated to look for them.

Verification scaling that matches adoption scaling. The pilot team had ten developers and careful oversight. The rollout has two thousand developers and the same oversight budget. If verification does not scale at the same rate as AI code generation, defect rates compound silently. McKinsey’s own finding that only “end-to-end” implementations show results supports this: partial adoption without full governance produces partial results at best.

Independent measurement of productivity claims. When a vendor says their tool improves productivity by 40%, ask for the study design. Self-report surveys of 300 executives are not evidence. Randomized controlled trials are. The difference matters because investment decisions follow.

The Forecast

McKinsey is right that AI developer tools will keep improving. Artificial Analysis data confirms the trajectory. The models are getting better, the tooling is maturing, and developer adoption will continue to rise.

None of that changes the core problem. Better tools without measurement infrastructure means faster production of unverified code. The organizations that build governance infrastructure now, while the tools are still improving, will be the ones positioned to capture actual value when the tools mature. The organizations running on executive perception will not know whether they captured value or not.

They will just believe they did. That is what the McKinsey survey measured.

Sources

McKinsey. “How AI Is Boosting Developer Productivity.” November 2025. 300 senior executives surveyed on perceived AI impact. Reported 16-45% improvements in productivity, quality, and delivery speed.
METR. 2025 randomized controlled trial. Experienced developers 19% slower with AI assistance. Perceived 24% faster.
NBER. February 2026 survey. 6,000 executives. 80%+ report zero measurable AI productivity gains.
LinearB. “2026 Engineering Benchmarks.” 8.1M pull requests, 4,800 teams. AI PR acceptance rate: 32.7% vs. manual: 84.4%.
Sonar. “State of Code 2026.” 1,149 developers. 96% do not fully trust AI output. 48% always verify.
Veracode. 2025 analysis. 40-48% of AI-generated code contains security vulnerabilities across 100+ LLMs.
Stack Overflow. “2025 Developer Survey.” 49,000+ respondents. 66% cite “almost right” problem.
Artificial Analysis. AI coding benchmark tracker. Performance roughly doubled in one year.

Victorino Group helps organizations build governance infrastructure for AI-generated code. If you are making investment decisions based on perception data, that is the problem we solve. Reach out at contact@victorinollc.com or visit www.victorinollc.com.