benn.substack Just Named What Releezy Ships: 'Wins Above Claude.'

A disclosure first. We sell baseline-relative measurement for AI work. That is exactly why this essay exists: an outside analyst just named the unit we ship, and we would rather quote him than ourselves.

On May 22, benn.substack published a piece called “WAC.” The acronym is borrowed from baseball, where “Wins Above Replacement” measures how many more wins a player produces than a generic minor-league call-up would. Benn proposes a software-buying analogue. Wins Above Claude. Value created above what an integrated Claude plus its MCPs already does out of the box, before any vendor’s wrapper, agent, or “AI feature” is layered on top.

The framing is procurement-friendly, mildly snarky, and structurally correct. It also closes a debate the industry has been avoiding.

The benchmark era is over because the baseline moved

Benchmarks compare models to fixed test sets. The test set is the constant. The model is the variable. That regime worked while frontier models advanced once or twice a year. It does not work now. Benn cites llm-stats.com: 62 AI models released in 126 days. The constant is no longer constant. Any benchmark score with a publish date older than six weeks is describing a different industry.

Worse, benchmarks evaluate models in test conditions. The actual buying decision is about deliverables produced inside a company’s tools, culture, codebase, and workflow. None of those are in the benchmark. A model that scores 78% on SWE-bench can be useless inside a specific monorepo with a specific build system and a specific code review culture. A model that scores 62% can be transformative there. The benchmark cannot tell you which.

WAC fixes the wrong end of the equation. Instead of fixing the test set and varying the model, you fix the deployment context, your company, your tools, your workflows, and you vary what is plugged in. The baseline becomes “the default Claude plus its standard MCPs, working on your real problems.” Any vendor pitching an agent, wrapper, or AI feature has to demonstrate value above that baseline. Not against a synthetic eval. Against the thing the buyer can already self-serve for $20 a seat.

Why this generalizes beyond Claude

The acronym is cute but the principle is portable. Substitute any sufficiently capable default. Wins Above ChatGPT Enterprise. Wins Above Gemini Workspace. Wins Above Copilot. The mechanic is the same: there is now a baseline assistant inside the workflow that already does a non-trivial portion of the job, and the only honest measurement is the marginal lift a paid vendor adds on top of it.

This is not a hypothetical. Ask any engineering leader what their developers actually use day to day. The answer involves Claude, ChatGPT, or Copilot more often than any procurement-approved AI tool. The baseline is already there. It is just not on the scoreboard.

That is the procurement consequence. Every “AI productivity” vendor pitch in 2026 is selling you a delta. Most of them are pretending the baseline is zero. Benn’s contribution is naming the lie out loud. The baseline is not zero. The baseline is whatever the default assistant already delivers inside your context, and you have to measure it before you can evaluate anyone’s claim of improvement.

The hiring analogue, which is the most useful part

Benn points at Linear’s hiring practice. Two to five day paid work trials instead of traditional interviews. The candidate does real work, in the real codebase, with the real team, and the team measures real output. Pass the trial, get hired. Fail the trial, get paid for the work and part ways respectfully.

The reason this matters for AI buying is that it solves the same problem benchmarks failed to solve. You cannot evaluate a candidate, human or AI, in a vacuum. The performance is contextual. It depends on the codebase, the tooling, the team norms, the existing review culture. Linear figured out that the only way to know if a senior engineer is actually senior in their context is to put them in their context and measure output. The same is true for an AI vendor. The only way to know if an AI agent produces value above the Claude baseline in your environment is to deploy it in your environment, alongside the baseline, and measure.

The implication: every meaningful AI procurement decision in the next 18 months will involve some version of a paid trial. Not a demo. Not a proof-of-concept slideshow. A real deployment, with real work assigned, measured against the baseline, over enough time to be statistically credible. Vendors who refuse this format are telling you their delta does not survive contact with reality.

What the buyer actually has to build

WAC as a phrase is doing real work. WAC as a measurement system is harder, and this is where most companies will discover the cost of having avoided the problem.

To measure Wins Above Claude, a buyer needs four things they probably do not have. First, a definition of what “winning” looks like for the work in question (shipped tickets, resolved cases, qualified leads, drafted contracts, the unit varies). Second, an instrumented baseline of the default-assistant version of that work over a credible time window (weeks, not hours). Third, a sample of the same work executed with the vendor’s tool in place, ideally split A/B or run sequentially under matched conditions. Fourth, an attribution model that survives the obvious confounders (operator skill differences, ticket difficulty mix, calendar effects).

That is not a benchmark. It is operational measurement infrastructure. Most companies do not run it for their human teams either, which is part of why the problem feels so foreign when applied to AI. Google just expanded its search box for the first time in 25 years to accommodate longer AI queries. The interface is changing because the underlying behavior changed. The measurement interface has to change too. WAC is the procurement-side version of that interface change.

Why we are claiming this now

The reason we wrote this post the same week benn published is that “Wins Above Claude” is the buyer-side name for what we have been arguing on the seller side for nine months. We have called it baseline-relative measurement, lift-over-default, agent-versus-floor. None of those landed. WAC will, because the AI buying community is already trained on benchmarks, and a benchmark replacement gets uptake faster than a brand new category.

We would rather operate inside benn’s vocabulary than ours. The job is the same. Measure the baseline before believing the pitch. Build the trial harness before signing the contract. Treat any vendor that cannot pass a Linear-style work trial in your context as a vendor who has not tested their own claims.

A caution. The danger of naming a category is that the category gets watered down. “WAC-compliant” will appear on vendor decks within a quarter, and most of those decks will be selling the wrong number. The defense is mechanical, not rhetorical. If a vendor cannot describe (a) what your baseline is, (b) how it was measured, (c) over what window, and (d) what delta they are claiming over it, with what confidence, the WAC label is decorative. Ask the four questions every time.

Do this now

Before you take your next AI vendor meeting, run a three-step exercise. Pick one workflow you are considering paying to improve. Measure how the default Claude or ChatGPT assistant performs on that workflow over the next two weeks, instrumented, with at least three operators. That is your baseline. Now make every vendor that walks in claim a specific delta above that number, with a proposed measurement window and confidence interval. The ones who can articulate this get a paid trial. The ones who cannot get a follow-up call after they figure it out.

The fastest way to make AI procurement honest is to stop letting the baseline be invisible. benn just gave the baseline a name. Use it.

This analysis synthesizes WAC (Wins Above Claude) (benn.substack, May 2026).

Victorino Group helps buyer and seller teams build the baseline-relative measurement that turns AI vendor claims into verifiable deltas. Let’s talk.