- Home
- The Thinking Wire
- Meta Killed Its Token Leaderboard. Usage Was Never Impact.
Meta Killed Its Token Leaderboard. Usage Was Never Impact.
Meta’s internal AI leaderboard tracked 73.7 trillion tokens consumed in roughly 30 days. The company nicknamed the ranking Claudeonomics. Then the CTO wrote a memo to around 6,000 employees telling them the number measured nothing.
The token figure was reported by The Decoder. The memo, reported by The Information, landed as AI costs at Meta approached billions of dollars for 2026. Andrew Bosworth, the CTO, put the correction in one line: “All motion is not progress and token usage alone is not a measure of impact of any kind.”
The memo reads as an obituary for a metric its own author had helped popularize. Meta built the leaderboard, watched it produce record consumption, and then confirmed in writing that consumption told them nothing about whether any of that motion produced value. The biggest spender in the room was the first to say the scoreboard was broken.
The Law That Broke the Scoreboard
Charles Goodhart, a British economist, gave us the rule in 1975: when a measure becomes a target, it ceases to be a good measure. Point a reward at a proxy and people optimize the proxy, not the thing it was meant to stand for. The proxy inflates. The underlying value stays flat.
Token consumption is a textbook proxy. It is easy to count, it moves fast, and it feels like effort. So when companies started ranking engineers by tokens burned, the number did exactly what Goodhart predicted. It went up. A leaderboard rewards the behavior it measures, and the behavior it measures is spending, not shipping. You gamify usage, you get usage.
We wrote about the rise of this game in Tokenmaxxing: token budgets turning into status symbols, consumption leaderboards spreading across Meta, OpenAI, and Shopify. That piece traced the trend on the way up. This is the trend’s biggest practitioner pulling the plug. The correction came from inside the house, from the company whose leaderboard produced 73.7 trillion tokens in a month.
The problem was the instrument, not the invoice. Spending on AI can be the correct call. The metric Meta chose to justify that spend was structurally incapable of justifying anything. A number that only rises cannot tell you when to stop, when to redirect, or whether the last dollar bought anything at all.
The Spend Runs Ahead of the Evidence
Plenty of companies spend like Meta. Few admit the disconnect out loud.
Uber exhausted its entire 2026 AI coding budget in four months and had to cap engineers at $1,500 per month. Roughly 70% of Uber’s committed code is now AI-generated. And yet, as COO Andrew Macdonald told Fortune, the link between that spend and the output “is not there yet.” A company can push AI through the majority of its codebase and still not be able to draw a straight line from cost to value.
The blindness is nearly universal. Only 26% of companies have comprehensive visibility into their AI costs, according to KPMG. Most organizations cannot see the denominator, let alone the numerator. McKinsey’s 2025 State of AI survey found that 88% of organizations now use AI in at least one function, while only 39% report any EBIT impact from it. Adoption is close to saturation. Financial return trails far behind.
The human cost of chasing the proxy shows up too. Creative Boom’s State of the Creative Industry 2026 survey (882 respondents) found that 86% of creative professionals use AI, only 10% believe its overall effect on their industry is positive, and 69% report burnout. Usage and benefit are different things. When the tool is mandatory and the value is unproven, the number on the leaderboard rises while the people behind it wear down.
The Artifact That Replaces the Leaderboard
Killing a bad metric leaves a vacuum. Something has to answer “is this working?” The teams getting real value are replacing usage counts with evidence-gated outcome scoring, and the clearest published example comes from Ably.
Ably’s engineering team scrapped volume metrics and built a scorecard around two questions, each scored on an anchored 1 to 5 scale. First: what new outcomes did AI unlock that were not previously possible? Second: how deeply is AI embedded in how the team actually works? Engineering leads review the scores monthly.
The mechanism that makes it work is the gate. A score does not go up because someone felt more productive or because token usage climbed. It goes up only when there is a concrete example of a new outcome. No example, no increase. That single rule inverts Goodhart. The target is no longer a proxy you can inflate by spending more. The target is a specific, nameable result you either produced or did not.
Notice what the scorecard refuses to measure. It ignores tokens, commits, and hours saved on a spreadsheet nobody validated. It measures whether the work moved, and it demands proof before it credits the movement. An anchored scale plus an evidence gate plus a human review cadence is a far cheaper instrument than a real-time token dashboard, and it answers the only question that matters.
This is the same discipline we argued for in the pinhole view of AI value: a single narrow metric gives you a confident, precise, wrong picture. And it complements the cost-side discipline from the end of the flat-fee era. Cost governance tells you what you are spending. Outcome scoring tells you whether the spend bought anything. You need both instruments, pointed at both sides of the ledger.
Do This Now
Find your usage metric and audit it against Goodhart. If your team reports token consumption, AI-tool adoption rates, or “percentage of code AI-generated” as a sign of progress, you are measuring the proxy. Ask one question of each metric: can this number go up without any new value being created? If the answer is yes, it is a leaderboard, and it will inflate.
Then build the replacement before you delete the old one. Pick two outcome questions your leadership actually cares about. Score them on an anchored scale. Gate every score increase behind a concrete, named example of a result that did not exist before. Review monthly with a human in the room. It is one meeting and one shared document. It costs less than the dashboard you are retiring, and unlike the dashboard, it can tell you when to stop.
Meta had the largest token leaderboard in the industry and the memo that ended it. The instrument that replaces it is not more expensive telemetry. It is a scorecard cheap enough to run in a spreadsheet and honest enough to say no.
This analysis synthesizes reporting aggregated by MLQ.ai (June 2026), Ably’s engineering-effectiveness scorecard (Ably, June 2026), and The State of the Creative Industry 2026 (Creative Boom, June 2026). Memo details and the Bosworth quote were reported by The Information, the token figure by The Decoder, and Uber’s budget figures by Fortune.
Victorino Group helps organizations replace AI usage metrics with evidence-gated outcome scoring that survives Goodhart’s law. Let’s talk.
All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →
If this resonates, let's talk
We help companies implement AI without losing control.
Schedule a Conversation