Three Companies, One Failure Mode: Goodhart's Law Comes for AI Adoption

TV
Thiago Victorino
7 min read
Three Companies, One Failure Mode: Goodhart's Law Comes for AI Adoption
Listen to this article

Three confessions arrived in the same week. Meta quietly shut down its internal AI-token leaderboard, the same board that, two months ago, was the showcase example of how big tech rewards aggressive AI usage. Amazon employees told reporters they are “tokenmaxxing” under pressure from a tool called MeshClaw and from internal usage rankings. Luis von Ahn, the CEO of Duolingo, sat down with Fast Company and publicly walked back the “AI-first” all-hands memo that went viral last year, admitting the AI still produces “plenty of slop” and that the framing was wrong even if the practice was not.

Three companies. Three different forcing functions. Same failure mode.

When AI usage becomes the metric, employees optimize the metric. Quality goes sideways or down. The leaderboard keeps climbing.

This is Goodhart’s Law colliding with AI adoption. Engineer’s Codex surfaced the framing on May 7 and it is the right lens. “When a measure becomes a target, it ceases to be a good measure.” The cobra effect dressed up in tokens.

We named the operational pattern in The AI Workforce Inflection back in March. What changed in May is the public posture. Companies that were proud of their leaderboards are now dismantling them, and the CEO who weaponized adoption as a memo is now publicly editing his framing. The pattern is no longer a discovery. It is a confession.

The Meta Reversal

Engineer’s Codex reported the dismantling first, and the detail that matters is not the shutdown itself. It is the reason. Meta did not kill the leaderboard because someone wrote a thoughtful internal memo about Goodhart’s Law. They killed it because, per the report, engineers were burning millions of tokens for “literally zero productivity.” The leaderboard was working exactly as designed. That was the problem.

Read the sequence carefully. First, the company built a measurement system that turned token consumption into a visible status competition. Then, predictably, employees competed. They wrote longer prompts. They ran more agents. They left context windows open. They padded usage in the same way that engineers used to pad lines of code under “lines committed” tracking in the nineties. None of this required malice. It required only that humans respond to incentives, which they always do.

The leaderboard did not measure productivity. It measured the visible signature of token consumption, which is correlated with productivity in some cases and uncorrelated in many others. Once the signature became the target, the correlation broke.

Goodhart published the original formulation in 1975 in the context of monetary policy. The cobra effect, the colonial-era story of a Delhi bounty on dead cobras that resulted in citizens breeding cobras to collect the bounty, is the same idea told with snakes. Token leaderboards are the same story told with GPUs.

The Amazon Disclosure

The Amazon report, per Ars Technica’s coverage, describes employees “tokenmaxxing” under pressure from internal tooling. The article body was behind a bot challenge we could not fetch directly, so the synthesis here leans on the published summary and the corroborating Engineer’s Codex framing. Two signals are clear from the report.

First, Amazon reportedly maintains its own internal usage leaderboards, similar in design to the Meta system that Meta just abandoned. Second, the report names “MeshClaw” as a tool that pushes employees toward AI usage in their daily workflow, and the phrase “performative usage” appears in the discourse around it. Performative is the operative word. Employees are reportedly running AI workflows because the workflow itself signals compliance with adoption metrics, not because the workflow produced a better outcome than the alternative.

Treat the MeshClaw specifics with care, because we did not read the original. Treat the pattern as confirmed, because it now appears at three different companies in publicly documented form.

What Amazon’s situation adds to the Meta story is duration. Meta noticed the gaming behavior and pulled the leaderboard. Amazon, per the report, is still running the system. The visible cost is mounting. The eventual unwind, if and when it happens, will be more expensive than Meta’s, because the behaviors it incentivized have had more time to calcify into team norms.

The Duolingo Walk-Back

The Duolingo story is structurally different and that is exactly why it matters.

Luis von Ahn sent an all-hands memo last year declaring Duolingo “AI-first.” The memo was leaked, mocked, and circulated as the canonical example of an executive forcing AI adoption from the top. On Fast Company’s Rapid Response podcast on May 13, von Ahn publicly walked the framing back.

His admission has three parts. The framing was wrong. The AI still produces “plenty of slop.” And, in the part that surprised most observers, Duolingo never did the layoffs the memo seemed to imply and actually increased headcount in the period after the memo went out.

This is the cleanest possible refutation of “adoption mandate as governance.” A CEO who issued the most famous AI-adoption ultimatum of the cycle is now saying, on the record, that the ultimatum was bad framing. He is not retracting the use of AI. He is retracting the framing of AI usage as an end in itself.

The pattern under all three companies is the same. Inputs got named as outcomes. Employees optimized the inputs. Quality lagged behind. The companies are now, in different ways, undoing the conflation.

Why This Cluster Is Different

In March we documented tokenmaxxing as an emerging signal. In May the cluster moved from observation to confession. The shift is small but operationally significant.

A signal can be ignored. A confession from three different companies cannot. If the Chief AI Officer at a Fortune 500 company sees Meta dismantle a leaderboard, Amazon get reported on for the same behavior, and a CEO walk back the most viral AI-adoption memo of the year, that officer is now exposed if they keep running an input-as-outcome metric inside their own organization. The shareholders, the board, the regulators, and the staff are watching the same news cycle.

The risk is no longer that you might be making the mistake. The risk is that you continue to make the mistake after three peer companies have publicly admitted that it does not work.

The Real Lesson is Not “Less AI”

The temptation in this kind of cycle is to read the news and conclude that AI adoption was a mistake. That reading is wrong. Duolingo’s headcount went up. AI usage at Meta and Amazon is not going to zero. The technology continues to ship value when deployed against real work.

The lesson is narrower and harder. Never let input become the outcome.

Token consumption is an input. Tool invocations are an input. Time spent in a coding agent is an input. None of these is a result. A result is a shipped feature that customers use, a closed deal, a resolved support case, a faster cycle time on a recurring workflow. The discipline is to keep the input metrics for internal capacity planning and the outcome metrics for performance and reward.

When the two collapse into one number, you have built a cobra farm.

Do This Now

If your organization has any kind of AI usage leaderboard, dashboard, or compensation tie to AI tool consumption, run this check this week. Pull the top ten users by token volume. Pull the bottom ten by token volume in the same role family. Look at their actual output over the last quarter. Not their input. Their output. Shipped artifacts, closed work, customer-facing outcomes.

If the correlation is weak or inverted, you have your Meta moment. Shut the leaderboard down before it shows up in a Fast Company interview two quarters from now.

If the correlation is strong, you are one of the rare cases where the input proxy actually tracks the outcome. Keep the dashboard, but do two things. Publish the underlying outcome metric alongside the input metric, so that the input cannot drift away from the outcome unnoticed. And put the leaderboard on a sunset clock. Goodhart’s Law does not care that your dashboard worked last quarter. Once the metric becomes the target, decay starts on the next reporting cycle.

The cluster this week is not a story about AI. It is a story about measurement. Measurement is governance. Three companies just told the same governance story in public.

The discipline is to learn it from their confession instead of from your own.


This analysis synthesizes Tokenmaxxing, Promomaxxing, and Misaligned Incentives in Tech (Engineer’s Codex, May 2026), Amazon Employees Are Tokenmaxxing Due to Pressure to Use AI Tools (Ars Technica, May 2026), and Duolingo’s CEO Admits Where He Got AI Wrong (Fast Company, May 2026).

Victorino Group helps teams design AI-adoption metrics that measure outcomes, not inputs. Let’s talk.

All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation