- Home
- The Thinking Wire
- Greg Wilson Just Gave Us an Academic Spine for AI Productivity Skepticism
Greg Wilson Just Gave Us an Academic Spine for AI Productivity Skepticism
Every measurement-skepticism essay we have published about AI coding productivity in the last six months has carried the same uncomfortable footnote: most of the numbers we were rebutting came from vendor blogs, and most of the numbers we were citing in rebuttal came from a small handful of studies that everyone keeps quoting because there is not much else to quote. The literature was real. It was just scattered. No one had assembled it.
Greg Wilson did that on May 20, 2026. Twelve Ways to Be Wrong About AI-Assisted Coding is the peer-reviewed spine the productivity debate has been missing. Each of the twelve failure modes Wilson catalogues comes with at least one academic source behind it, and the citations are mostly from 2025 and 2026, which means the field has finally generated enough empirical work to do an actual review of.
If you have spent any time arguing with a vendor’s “40% faster with our copilot” claim and felt yourself reaching for the same three references, this is the document to replace that toolkit with.
What the Studies Actually Say
The headline finding running through Wilson’s review is that vendor benchmarks and field measurements disagree by a factor that should embarrass anyone still quoting the former. Becker (2025) found that GitHub Copilot produced a 55% task speedup on artificial coding problems. Run the same tool against real open-source maintenance work and the effect inverts: a 19% slowdown, not a speedup. The Peng (2023) study Wilson cites for the original 55% number was on a constructed task that bears no resemblance to maintaining a five-year-old codebase with seventeen contributors.
The senior developer finding is the one that should make engineering leaders stop. The same body of research that shows junior developers getting genuine acceleration also shows senior developers experiencing a 19% productivity decline. The mechanism is not mysterious. Seniors absorb the review burden for AI-generated code that juniors merge. The tool’s output becomes their input, and the input is lower quality than what the senior would have written themselves. We wrote about this dynamic in The Speed Trap of AI Coding; Wilson’s review now gives it a citation.
Liu (2026) measured the quality drag directly: more than 15% of AI-generated commits introduce quality issues, and roughly 25% of those issues persist long-term. That is not a transient cost. That is technical debt being shipped at a rate that exceeds normal code review’s catch rate, and it compounds.
He (2026) studied Cursor adoption specifically and found that velocity gains were transient while complexity increases were persistent. The team got faster for a quarter, then settled back to baseline velocity while carrying a permanently higher complexity load. This is the output-competence decoupling we have written about, measured longitudinally.
The enterprise studies tell the same story from the procurement side. Bakal (2025) reported a 33% acceptance rate for AI suggestions in production environments, with no correctness tracking attached. The organization buying the tool knows how often developers accept the suggestion. It does not know how often the accepted suggestion was right. Weisz (2025) at IBM measured uneven gains across users in a controlled study, with the variance large enough that aggregate “productivity lift” numbers became meaningless.
The security floor is the one that should make CISOs read this paper twice. Pearce (2022), confirmed by Dora (2025), tested five major LLMs against established web security standards. All five failed. Not “performed below expectations” failed. Failed. The implication is that any team measuring AI coding productivity without measuring AI coding security is computing a numerator while ignoring a denominator that may already exceed it.
Why This Is a Literature Review and Not Another Essay
The reason Wilson’s piece matters is not that it makes a new argument. The argument has been made. It matters because for the first time, you can hand someone the citations.
Every time we have written about the two-percent productivity gap, or about why measuring the team and not the model is the only honest move, or about the harness difference, we have been arguing in a vacuum where the other side cites SaaS marketing decks and our side cites three studies on repeat. Wilson catalogued the rest. Becker, Peng, Liu, He, Bakal, Weisz, Pearce, Dora. The names matter because the names are how the conversation moves from belief to citation.
If you sit in a meeting where someone says “our developers are 40% faster with this tool,” you can now ask three questions with academic backing:
- What is the task distribution? Becker showed the 55% speedup collapses to a 19% slowdown when you move from artificial problems to real maintenance work.
- What is the seniority distribution? The same productivity number averaged across juniors and seniors hides a decline at the senior end.
- What is the persistence horizon? He showed the velocity gain is transient and the complexity cost is permanent.
Three questions. Three citations. The vendor can no longer answer with another deck.
The Implementation Stays the Same
Wilson’s review does not change what a serious measurement program looks like. It just changes what the conversation around procurement and adoption sounds like. The implementation work we have published still stands.
If you want to measure your team rather than the model, The Software Centaur Era is still the framework. If you want to understand why output competence and verification have to be measured separately, the verification layer essay is still the breakdown. If you want to know why the same model produces different productivity numbers in different harnesses, the harness difference is still the explanation.
What changes is what you put in front of the people who do not read those essays. You put Wilson in front of them. You put Becker, Liu, He, Pearce in front of them. The skepticism essays were for practitioners. The literature review is for everyone else.
Do This Now
Block thirty minutes this week. Read Wilson’s piece end to end. Pull the three or four citations most relevant to the productivity claims you are currently being asked to evaluate or defend against. Add them to whatever shared document your engineering organization uses for vendor evaluation. The next time someone walks in with a “40% faster” deck, the document already has the counter-citations loaded.
Then take the harder step: audit your own internal productivity claims for AI coding tools. If you have told an executive “our team is X% more productive since adopting Y,” check that claim against Wilson’s twelve failure modes. The most common discovery is that the metric measured something other than what its name implied. That is the moment to fix the metric, not the moment to defend it.
The productivity debate just stopped being a vibes contest. The literature is assembled. The names have citations. The vendors no longer get the last word by default.
This analysis synthesizes Twelve Ways to Be Wrong About AI-Assisted Coding (Greg Wilson, May 2026).
Victorino Group helps engineering leaders replace vendor productivity claims with measurement that survives a peer review. Let’s talk.
All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →
If this resonates, let's talk
We help companies implement AI without losing control.
Schedule a Conversation