Measuring AI in Software Development: What the Data Actually Shows

Here’s a number that should make you pause: developers using AI tools believed they were 20% faster. They were actually 19% slower.

This finding comes from a rigorous randomized controlled trial by METR, involving experienced open-source developers working on mature codebases. It’s not an outlier—it’s a signal that our intuitions about AI productivity are unreliable.

Meanwhile, Jellyfish reports that organizations with 80-100% AI adoption see 110%+ gains in pull request throughput. Both findings can be true. The difference is context, measurement, and understanding what you’re actually optimizing for.

The software development lifecycle is being redefined. Andrew Lau, CEO of Jellyfish, puts a timeline on it: within three years, the PDLC will look completely different. This is one of the biggest transformations since the invention of the internet.

But transformation without measurement is just hope with extra steps.

The Perception Gap

The METR study reveals something uncomfortable: developers don’t know how fast they’re working. Sixteen experienced developers completed 246 tasks across familiar codebases—projects they’d contributed to for over five years. With AI assistance, they reported feeling significantly more productive. The stopwatch disagreed.

Why the disconnect?

AI tools are genuinely helpful for certain tasks: boilerplate generation, syntax lookup, exploring unfamiliar APIs. This help feels like acceleration. But in mature codebases with established patterns, the context-switching cost of interacting with AI—crafting prompts, evaluating suggestions, correcting hallucinations—can exceed the time saved.

The lesson isn’t that AI tools don’t work. It’s that perceived productivity and actual productivity are different metrics, and conflating them leads to bad decisions.

The Trust Gap

Only 3% of developers “highly trust” AI-generated code. 46% don’t fully trust it at all. Yet 90% of teams now use AI tools, up from 61% a year ago.

This creates a strange dynamic: near-universal adoption paired with pervasive skepticism. Developers use the tools but verify everything. In some cases, this verification overhead negates the productivity gains.

The trust gap isn’t irrational. AI models hallucinate. They generate plausible-looking code that fails edge cases. They optimize for the immediate context while missing architectural implications. Skepticism is appropriate.

But skepticism without measurement leaves organizations guessing. Is the verification overhead worth it? For which tasks? At what adoption level?

The Leadership Gap

Here’s another number: 76% of executives believe their teams have embraced AI. Only 52% of engineers agree.

This perception gap matters because it shapes investment decisions. Leaders who believe adoption is high may underinvest in training and enablement. Leaders who don’t see the productivity gains their reports claim may cut tooling budgets prematurely.

Only 20% of engineering teams measure AI impact with actual engineering metrics. The rest rely on surveys, anecdotes, or nothing at all.

What the Data Actually Shows

Jellyfish tracked AI adoption across 600+ organizations throughout 2025. The findings are nuanced.

Adoption is accelerating. Code assistant usage grew from 49% to 69% between January and October. Review agent adoption nearly quadrupled, from 14.8% to 51.4%. This isn’t hype—it’s infrastructure spending and workflow changes.

Throughput gains are real but conditional. Organizations moving from 0% to 100% AI adoption saw 113% more pull requests per developer—from 1.36 to 2.9 weekly. But this correlation includes selection effects. Teams that adopt AI fully tend to be high-performing already.

Cycle time improvements are consistent. Median cycle time dropped 24%. This is harder to explain away with selection bias. Faster feedback loops benefit everyone.

Bug rates tell a complex story. Bug fix PRs increased from 7.5% to 9.5% of total PRs. Is that because AI introduces more bugs? Or because faster iteration exposes bugs sooner? Or because AI makes bug fixes easier to ship? The number alone doesn’t answer the question.

Tool retention reveals something. After 20 weeks, Copilot and Cursor retain 89% of users. Claude Code retains 81%. High retention suggests genuine utility, not just novelty.

The Three Adoption Walls

Lau identifies three barriers that separate organizations with marginal AI gains from those seeing 100%+ improvements:

Wall 1: Enablement and Training. Many organizations deploy AI tools without teaching developers how to use them effectively. Prompt engineering is a skill. So is knowing when not to use AI—when the context-switching cost exceeds the benefit.

Wall 2: Right Tools and Models. Not all AI coding assistants are equivalent. The model matters. The integration matters. The context window matters. Organizations that treat AI tools as interchangeable commodities miss optimization opportunities.

Wall 3: Redefining Roles and Processes. This is the hardest wall. When developers spend more time writing prompts than code, they’re doing specification work—traditionally the domain of product managers and architects. Role boundaries blur. Team ratios shift. Incentive structures need updating.

Organizations that clear all three walls see the 110%+ gains. Organizations that hit any wall see the paradoxical slowdowns the METR study documented.

A Measurement Framework That Works

Lau proposes a three-layer framework that moves beyond vanity metrics:

Layer 1: Adoption. Track both quantitative usage (how often, which tools, which models) and qualitative engagement (are developers prompting effectively? are they using AI for appropriate tasks?). Adoption without effective usage is noise.

Layer 2: Throughput. Measure PR rate, cycle time, review latency—but at the systems level, not individual level. Individual developer metrics create perverse incentives. Systems metrics reveal flow.

Layer 3: Outcomes. Connect development activity to business results: roadmap progress, defect rates, customer impact. A team shipping twice as many PRs that don’t move the product forward hasn’t improved.

Most organizations measure Layer 1 poorly and ignore Layers 2 and 3 entirely. This is why 80% of teams can’t answer basic questions about AI ROI.

The Systems Thinking Imperative

Here’s an insight that gets lost in the productivity discourse: the PDLC is a flow, not a collection of isolated tasks.

Accelerating code generation without accelerating code review creates a review bottleneck. Faster review without faster compliance approval creates a compliance bottleneck. Every local optimization that ignores system dynamics risks creating global slowdowns.

This is why AI tools are expanding beyond coding. Review agents grew from 14.8% to 51.4% adoption in 2025. Testing is next. Lau identifies it as a frontier: we need new ways to formalize “truth” and intent so AI can verify, not just generate.

The organizations that will win aren’t those with the fastest code generators. They’re those with the most coherent end-to-end development flow.

The Real Insight

Lau articulates something that’s been lurking in the background of every AI coding discussion:

“For decades, we thought coding was the hard part. It turns out describing what to build is harder.”

When a developer writes a prompt, they’re writing a specification. They’re doing design work, not implementation work. The cognitive labor hasn’t disappeared—it’s shifted upstream.

This has profound implications for hiring, training, and career development. The skills that matter are changing. Organizations that recognize this early will find the talent transition smoother. Those that don’t will find themselves with developers who are excellent at something AI now handles and undertrained for what AI needs from them.

Risk Acceleration

One more data point that deserves attention: AI heightens compliance needs, not reduces them.

Code moves faster. Review cycles compress. The window between “merged” and “deployed” shrinks. Every control that operated on a weekly cadence now needs to operate daily or continuously.

Organizations that see AI as a reason to relax governance controls have it backwards. AI is a reason to embed controls earlier, automate compliance checks, and build auditability into the pipeline from the start.

The organizations moving fastest with AI are also the ones with the most robust governance frameworks. This isn’t coincidence. Governance enables speed by reducing rework, avoiding incidents, and building the organizational trust necessary for expanded AI autonomy.

What To Do With This

If you’re an engineering leader trying to make sense of AI’s impact, here’s a practical path forward:

Stop relying on developer sentiment. The METR study proves perception doesn’t track reality. Instrument your systems. Measure actual cycle times, PR throughput, and deployment frequency. Compare periods with and without AI assistance for equivalent task types.

Measure at the system level. Individual developer metrics create gaming and resentment. System metrics—end-to-end cycle time, flow efficiency, bottleneck identification—reveal where AI helps and where it creates new constraints.

Invest in training, not just tools. The adoption wall that matters most is the skills gap. Developers need to learn when to use AI, how to prompt effectively, and how to verify outputs efficiently. Budget for this.

Redefine roles proactively. If your developers are spending significant time on specification work, acknowledge it. Update job descriptions, hiring criteria, and career ladders. The transition is happening whether you manage it or not.

Connect to outcomes. Throughput gains that don’t translate to customer value are vanity metrics. Build the measurement infrastructure that connects development activity to business results.

The productivity gains from AI in software development are real. So are the perception gaps, the trust issues, and the measurement failures that prevent organizations from capturing them.

The organizations that will benefit most aren’t those with the most AI adoption. They’re those with the clearest understanding of what AI changes, what it doesn’t, and how to measure the difference.

Sources: McKinsey Technology Interview with Jellyfish CEO Andrew Lau, METR RCT Study, Jellyfish 2025 AI Metrics in Review