When You Certify an Agent, You Change What It Optimizes For

eBay quietly redesigned its Top-Rated Seller badge a few years ago. The old criteria leaned on consumer feedback, which often punished sellers for things outside their control: courier delays, misunderstood listings, weather. The new criteria leaned on administrative metrics: shipping timeliness based on tracking data, and how unresolved buyer claims were handled.

A team of marketing economists, Xiang Hui, Ginger Zhe Jin, and Meng Liu, treated the redesign as a natural experiment. Their 2025 paper in the Journal of Marketing Research documents the result. Sellers improved on the dimensions that determined the badge. Performance gains clustered just above the cutoff points that decided who qualified. And on the buyer side, fewer than one percent of shoppers ever opened the detailed seller ratings page. The badge in the search result did the work. The report behind it was wallpaper.

The paper is about humans selling sneakers and laptops. It is not about AI. But the mechanism it documents is the same mechanism we are now wiring into every certification regime for agents. The eBay study is the cleanest empirical evidence in years that certification does not measure behavior. It produces it.

Cutoff-Clustering Is the Real Finding

The headline from Hui, Jin, and Liu is easy to compress into a slogan: “sellers improved.” That misses what makes the paper interesting.

The improvement was not uniform. It concentrated above the cutoff. Sellers near the threshold did the most work. Sellers safely above did less. Sellers far below often did nothing. The shape of the response reveals the optimization. People were not getting better at selling. They were getting better at clearing the bar.

This is Goodhart’s Law in a peer-reviewed dataset. When a measure becomes a target, it ceases to be a good measure. The eBay redesign turned shipping timeliness and claim handling into targets, and the marketplace responded the way economic theory predicts: by hitting the targets, by the smallest margin necessary, while shifting attention away from anything not measured.

The study found the strongest effects in categories where the administrative metrics correlated well with what buyers actually wanted. That is the comforting half of the result. The uncomfortable half: in categories where the correlation was weaker, sellers still optimized for the metric. The badge does its work whether the metric is wise or not.

The Mechanism Ports to Agents

Now run the same logic on an agent.

A frontier lab publishes a benchmark suite. A vendor publishes a “Trust Score.” An enterprise builds an internal eval gate. Each is a cutoff. Each is a binary signal: pass or fail, certified or not, shipped or held. Models that clear the bar enter the catalog. Models that miss it are revised until they pass.

What does revision look like in practice? Reinforcement learning from human feedback. Eval-tuned post-training. Synthetic data generation aimed at the failure modes the gate flags. None of these are abstract. They are reward signals. And reward signals shape policy.

If the eBay sellers had been algorithms, we would describe their behavior as policy optimization against a categorical reward. They were rewarded for crossing a threshold and unrewarded for the marginal effort beyond it. Their policy adapted accordingly. We do exactly the same thing to agents, on purpose, every day.

The implication is awkward. The behaviors we measure are the behaviors we get. The behaviors we do not measure either atrophy or never develop. A model trained to clear an honesty eval will be honest in the eval distribution. A model trained to clear a safety gate will avoid the failure modes in the gate. Anything outside that envelope is not part of the optimization, and so it is not part of the product.

We have written about this from the architecture side. In The Architecture of Agent Trust, the argument was that reliable agent behavior comes from the environment, not the prompt. The eBay study is the marketplace-economics version of the same point. The seller’s environment, post-redesign, made certain behaviors high-reward and others irrelevant. The seller’s policy followed.

Nobody Reads the Report

The other finding deserves its own paragraph because it is the one most people skip.

Less than one percent of buyers viewed detailed seller ratings on the profile pages. The trust signal that moved the market was the badge in the search result. The underlying ratings, the actual evidence behind the certification, were almost never inspected.

This is the part that should make anyone running a vendor eval pause.

The model card sits on a page nobody opens. The eval report is published on a wiki nobody reads. The audit trail is in a folder nobody navigates to. What people see is the badge: “ISO 42001 certified,” “Trust Score 9.4,” “passed internal red team.” That is the signal that moves the procurement decision, that survives the meeting, that ends up in the security review checkbox.

In ISO 42001: When AI Governance Becomes a Product Feature, we made the case that external certification is becoming a vendor selection criterion. That is still true. The eBay study adds a sharper edge to the same point: the certification will matter more than the underlying evidence, because the underlying evidence will not be read. Design accordingly.

Eval Gates Are Reward Signals

Here is the design rule that falls out of all of this.

If your eval gate is binary, assume the model will be optimized to clear it by the smallest margin that will reliably pass. Not because the model “wants” to game the eval. Because the optimization process you wrap around the model rewards exactly that behavior. You built a cutoff. Cutoffs produce clustering.

If your eval gate measures one dimension well and ignores another, assume the ignored dimension will degrade. Not because anyone decided to neglect it. Because effort, training data, and capability budgets are finite, and they will flow toward what is rewarded.

If your eval report is detailed, assume almost no one will read it. The badge is the artifact that travels. The report is the artifact that justifies the badge to the few people who ask.

These are not pessimistic predictions. They are baseline assumptions. They are what a careful procurement team or a careful engineering team should bring into every conversation about AI certification, internal or external.

The practical consequence is to design eval gates the way you would design a market mechanism. Multiple metrics, not one. Continuous scores you publish, not just pass/fail badges you stamp. Random spot checks against dimensions not in the gate, so capabilities outside the measured set do not silently atrophy. Cutoffs that move when the population of certified agents starts to cluster against them.

The eBay redesign worked, in the categories where it worked, because the administrative metrics were genuinely close to what buyers cared about. That is a high bar. Most AI eval gates today are not at that bar. They are convenient proxies that we built quickly, that we now use to make consequential decisions, and that we will train the next generation of models to satisfy.

The certification will produce the behavior it measures. That is the whole finding. Now the question is whether the behavior it measures is the behavior you actually want.

This analysis builds on Hui, Jin & Liu (2025), “Designing Quality Certificates: Insights from eBay,” Journal of Marketing Research, 62(1), 40-60 (doi:10.1177/00222437241270222), as summarized by the American Marketing Association (April 2026).

Victorino Group helps enterprises design eval gates that survive Goodhart’s Law. Let’s talk.

Cutoff-Clustering Is the Real Finding

The Mechanism Ports to Agents

Nobody Reads the Report

Eval Gates Are Reward Signals

If this resonates, let's talk