16,000 Simulated Shopping Rounds: Marketing's First Real Data on AI Agents

For two years the marketing-to-agents conversation lived on hypothesis. Vendors sold schema, AEO playbooks promised citation lift, and the persuasion stack kept running as if the buyer were still a human. The skeptics, ourselves included, argued the foundations had shifted. We had vibes. We did not have controlled experiments.

This week we got two.

Researchers at Bayes Business School and King’s Business School, writing in Harvard Business Review on May 12, ran over 16,000 simulated choice situations across four AI shopping agents and eight promotional mechanisms in four product categories. The headline: only ratings consistently pushed AI choices upward. Strike-through pricing, scarcity, countdown timers, social proof, the entire e-commerce persuasion stack, produced unstable, model-dependent effects. Some reasoning models actively penalized aggressive cues as manipulation signals.

A day earlier, Ahrefs published a difference-in-differences study on 1,885 pages that added JSON-LD schema and tracked citation behavior across AI Mode, ChatGPT, and AI Overviews. Schema produced a 2.4% lift in AI Mode citations, statistically indistinguishable from zero. In AI Overviews, schema pages lost an average of 12 daily citations.

Two studies, one message. The hypothesis era is over. The most-recommended AEO tactic does not work, and the persuasion playbook that built modern e-commerce does not transfer to the agent.

What the HBR study actually measured

The setup matters because it is the first piece of marketing research we have seen that treats the AI shopper the way pharmacology treats a drug. Sabbah and Acar tested GPT-4.1-mini, GPT-5, Gemini 2.5 Pro, and Gemini 2.5 Flash Lite. They held the product set constant, varied one promotional cue at a time, and measured whether the agent’s choice changed. Four product categories (headphones, vacuum cleaners, vitamins, running shoes), eight mechanisms (strike-through pricing, percent discounts, scarcity, countdown timers, ratings, social proof, badges, free shipping), 16,000+ rounds.

The findings, in order of practical weight.

Ratings worked everywhere. Across all four models and all four categories, higher star ratings increased the probability of selection. This is the only universal effect in the study. It is also the one signal that maps to something verifiable on the product page itself, independent of how the merchant decorates the listing.

Discount cues worked unevenly. Strike-through pricing improved selection for GPT-4.1-mini and Gemini 2.5 Flash Lite. Gemini 2.5 Pro’s response weakened as the discount got more extreme. GPT-5 showed signs of penalizing scarcity in certain categories. The same cue, designed to push the same human buyer in the same direction for thirty years of e-commerce, produced opposite reactions inside four agents shopping the same shelf.

Reasoning models penalized persuasion. This is the line that should hang on every CMO’s wall. The advanced models behaved as if intensity itself were a signal of low quality. “Only 2 left” did not create urgency. It created suspicion. The agent appears to read the manipulation, infer that the seller is the kind of seller who manipulates, and downweight accordingly.

The framing the authors use is correct. Persuasion architecture optimized for human cognitive biases (loss aversion, anchoring, scarcity heuristics) does not transfer to a system that has read the entire literature on those biases and was trained, at least partly, to resist them.

What the Ahrefs study killed

Patrick Stox and team ran a difference-in-differences on 1,885 pages that added JSON-LD schema between October 2025 and February 2026. They compared citation behavior on those pages against a matched control group across three AI surfaces.

The numbers, plainly:

AI Mode: +2.4% citation lift, p-value not significant.
ChatGPT: +2.2% citation lift, also not significant.
AI Overviews: -4.6% citation behavior. An average loss of roughly 12 daily citations per page after schema was added.

If schema worked the way two years of AEO recommendations claimed it worked, this study would have shown a double-digit lift on at least one surface. It showed a coin flip on two and a measurable penalty on the third. The AI Overviews result, which contradicts received wisdom most aggressively, deserves the caveat that DiD on observational data cannot prove causation. Adding schema does not appear to cause the loss. It may simply correlate with sites that are restructuring in ways AI Overviews already disliked. Either way, the upside that practitioners were told to expect is not in the data.

The right reading is not that schema is harmful. The right reading is that decorative schema is noise. If your product schema improves the human-facing presentation, ship it. If you are adding JSON-LD because a tool told you it boosts AI citations, the tool was wrong.

Reading the two studies together

If you read HBR alone, the lesson sounds like “redesign your persuasion stack for agents.” If you read Ahrefs alone, the lesson sounds like “stop optimizing markup.” Read together, the deeper pattern is structural.

Persuasion stops working when the buyer can read the manipulation. AEO stops working when the answer engine optimizes for substrate it already trusts (publishers, reviews, transactional databases) instead of decorations that authors control.

Both studies point to the same fundamentals. Verifiable product signals. Authentic third-party reviews. Price honesty. Substrate the agent can audit against the rest of the web. The plays that survive are the ones that were true before the agent showed up. The plays that fail are the ones that were marketing dressed as information.

Brainlabs’ Organic Media Mix framework, published the same week, gives this the operational frame a CMO can actually deploy. Their case data on a single brand showed 23% of ChatGPT citations coming from Reddit versus 3% of AI Overviews citations from the same source. Channels matter per platform, and the channel mix that wins on one engine can be irrelevant on another. The OMM frame stops treating “AI visibility” as a monolith. It treats each surface as its own publication problem with its own substrate.

That is consistent with what we argued in our piece on the asymmetric marketing governance stack: the layer marketers actually control is narrower than vendors suggest. HBR and Ahrefs just provided the controlled data for that argument.

The agent is not a search engine that learned to talk

The biggest unspoken assumption in two years of AEO writing was that AI search is search-plus-summary. Optimize the index, get cited. The HBR data falsifies that for shopping. The Ahrefs data falsifies it for citations on at least one surface.

The agent is closer to a junior analyst with a calculator and a reading list than to a search engine. It penalizes obvious selling. It weights signals it can cross-check. It treats decorative content as decoration. Reasoning models will get better at this, not worse. The intensity penalty Sabbah and Acar found is not a bug being patched. It is a feature being trained.

That has implications past the product detail page. If the agent reads marketing intensity as a quality signal in reverse, then the high-pressure surfaces marketers love (popups, countdowns, social proof badges, “people are viewing this”) become liabilities at the moment an agent is reading. The same surface can drive a human conversion at 9am and depress an agent recommendation at 9:05am. Right now the agent traffic share is small, but it grows monthly, and the cost of operating two opposite incentives on one page is not zero.

What we already knew that this confirms

We argued in the shopping verification collapse piece that the agent would treat product claims the way an auditor treats vendor PR. HBR’s reasoning-model penalty is the empirical version of that argument. We argued in the AEO agent-readable surface piece that marketing now writes for two readers, one of whom counts tokens. Ahrefs’ schema study tells us the second reader does not pay extra for decorative tokens. We argued in the hard-signals piece that fundamentals would beat tactics. Two months later the controlled data arrived.

This is the value of a content arc that compounds. We were not waiting for these studies to know what to recommend. We were waiting for the studies to make the recommendation unarguable.

What to do this week

Five moves, in order of return on time.

First, audit your product detail page for persuasion intensity. Count the scarcity cues, countdown timers, “selling fast” badges, and recovered-cart popups firing on the page. Each one is now a two-sided trade. If your agent traffic is above 5% of sessions and growing, the trade is already negative on some segments. The fix is not removal everywhere. It is making intensity a configurable layer that downgrades when the user-agent or behavioral signal looks like an agent session.

Second, invest in ratings substrate. The HBR study confirms what review-platform vendors have been saying for years, but with one new condition: the agent verifies. Inflated ratings, planted reviews, and review-gating patterns that an agent can detect from cross-checking review platforms and merchant policies will be discounted the way the agent discounted scarcity. Ratings are durable only if they are real.

Third, stop spending net-new budget on decorative schema. If you have a schema implementation that helps human-facing presentation (recipes, events, products with structured price/availability that humans see in SERPs), keep it. Do not commission a new schema project on the promise of AI citation lift. The Ahrefs DiD is the cleanest test we have. The answer is no.

Fourth, build a per-surface citation map the way Brainlabs’ OMM frames it. Stop treating “AI visibility” as one number. Reddit drives ChatGPT, news drives AI Mode, transactional databases drive shopping agents. Each surface gets its own substrate plan.

Fifth, write your governance policy for the persuasion layer the way you wrote it for the consent layer. Which surfaces are allowed to fire when an agent is the likely reader. Who approves a campaign that includes pressure tactics. What gets logged. Marketing governance is not just consent and data; it is also the integrity of the buying signal. The HBR study just made that argument unavoidable.

The persuasion stack that built modern e-commerce was a thirty-year accumulation of behavioral economics, optimized for human cognition. It was never going to transfer cleanly to a reader that has read the textbooks. The good news in the data is that the durable plays remain durable. Real reviews. Honest pricing. Substrate the agent can verify. The bad news is that the decorative plays, including the ones marketers paid the most for in 2024 and 2025, are now measurable as decoration.

The hypothesis era is over. The data is in. Plan accordingly.

This analysis synthesizes Research: Traditional Marketing Doesn’t Work on AI Shopping Agents (HBR, May 2026), We Tracked 1,885 Pages Adding Schema. AI Citations Barely Moved (Ahrefs, May 2026), and The Organic Media Mix (Brainlabs, May 2026).

Victorino Group helps marketing teams build agent-aware measurement and governance. Let’s talk.