When AI Measurement Breaks: METR's Dependency Problem

In January, I wrote about METR’s finding that experienced developers were 19% slower with AI tools. The study mattered because it was rigorous: randomized, controlled, measured against real codebases by real contributors.

METR just published an update. The new results are interesting. But the real story is what happened to the experiment itself.

The Experiment That Ate Itself

METR expanded the original study from 10 to 57 developers. The returning 10 showed -18% speedup (confidence interval: -38% to +9%). The new 47 showed -4% speedup (CI: -15% to +9%). Neither result is statistically significant, and METR acknowledges selection bias likely suppresses the true benefit of AI tools.

That acknowledgment matters less than the reason behind it.

Thirty to fifty percent of developers avoided submitting tasks where AI provides the most uplift. They cherry-picked tasks where working with or without AI felt comparable. One developer admitted it directly: “I found I am actually heavily biased sampling the issues … I avoid” the ones where AI saves the most time.

The experiment was supposed to measure AI’s impact on productivity. Instead, it revealed that developers can no longer cleanly separate AI-assisted work from non-AI work. The two have fused.

Recruitment Collapse

METR cut pay from $150/hour to $50/hour. That alone would skew the participant pool. But the deeper problem is that developers increasingly refuse to participate in the control condition.

“I’m torn. I’d like to help provide updated data but also I really like using AI!”

“My head’s going to explode if I try to do too much the old fashioned way.”

These are not casual complaints. These are experienced open-source contributors saying they cannot (or will not) do their work without AI tools. Some developers in the study failed to complete AI-disallowed tasks entirely.

METR is not losing participants to indifference. They are losing participants to dependency.

Time Itself Breaks

Here is the subtlest problem. Agentic AI tools change the nature of developer time.

When a developer kicks off an AI agent to handle a task, they do other work while the agent runs. How do you measure that? The developer spent 10 minutes on the task. The agent spent 40 minutes. The calendar says 40 minutes passed. The developer was productive on something else for 30 of those minutes. Traditional time measurement assumes one person, one task, one clock. Agentic workflows violate all three assumptions.

METR flags this explicitly. Their measurement infrastructure was built for a world where developers do one thing at a time. That world is disappearing. Roughly 4% of GitHub commits are now authored by Claude Code, and that number is growing monthly.

Why This Is a Governance Signal

Organizations typically measure AI impact through controlled comparison: team A uses AI, team B does not. Before and after. With and without.

METR, a research organization with explicit methodological rigor and funded motivation to get this right, cannot maintain a clean control group. If they cannot do it in a research setting, your organization cannot do it in production.

This is not a measurement challenge. It is a measurement failure. The instrument is broken.

In our earlier analysis, we identified the perception mismatch: developers believed they were faster when they were actually slower. That finding assumed you could still measure “actually slower.” METR’s update shows that assumption eroding. You cannot measure something when participants refuse the control condition, cherry-pick favorable tasks, and multitask in ways that make time tracking meaningless.

The Dependency Gradient

What METR documented is a dependency gradient, not a binary state. Some developers find AI helpful but optional. Others find it so embedded in their workflow that removing it feels like removing their IDE.

This gradient matters for governance because it determines reversibility. If your organization decided tomorrow to revoke AI tool access (for security, compliance, or cost reasons), could your developers still ship at an acceptable pace?

METR’s evidence suggests the answer is increasingly no. Not because the tools are irreplaceable in theory, but because developers have restructured their working patterns around them. They select different tasks. They approach problems differently. They have, in some cases, stopped maintaining the skills they would need to work without AI.

The parallel is not to a tool. It is to infrastructure. You can swap out a hammer. You cannot casually swap out your build system.

What This Means for Measurement

If pre-AI measurement frameworks are breaking, what replaces them? Three shifts are necessary.

Measure outcomes, not activity. Stop trying to isolate AI’s contribution to individual tasks. Instead, measure what reaches production, what customers use, what moves the business. The question “how much faster does AI make developers?” may be unanswerable. The question “are we shipping better software?” is not.

Treat AI dependency as a risk register item. METR’s recruitment collapse is a preview. If your senior engineers cannot function without a specific vendor’s AI tools, that vendor has leverage over your operations. Map the dependency. Quantify the switching cost. Build contingency plans.

Monitor the skill atrophy. Developers who always delegate certain tasks to AI stop practicing those tasks. Code review skills decay when AI handles first-pass review. Debugging skills decay when AI generates the fix. This is not hypothetical. METR’s participants demonstrated it: some could not complete tasks without AI that they presumably could have completed two years ago.

The Uncomfortable Question

Every organization running AI productivity studies is measuring a moving target. The baseline keeps shifting because developers keep adapting. By the time you have statistically significant results, the workflow has changed again.

METR’s honest response was to redesign the entire experiment. They are moving toward within-developer comparisons (same person, with and without AI) and time-boxed tasks that control for the multitasking problem.

Most organizations will not have METR’s methodological sophistication. They will rely on before/after comparisons that cannot account for the dependency effects METR documented. They will draw conclusions from data that is structurally compromised.

The organizations that handle this well will be the ones that stop asking “how much does AI help?” and start asking “what happens when AI is unavailable?” The first question is becoming unanswerable. The second is becoming urgent.

Sources: METR, “We Are Changing Our Developer Productivity Experiment Design,” February 2026. See also our earlier analysis: Measuring AI in Software Development: What the Data Actually Shows.