Statistical Power Is Marketing's Hidden AI Governance Failure

TV
Thiago Victorino
5 min read
Statistical Power Is Marketing's Hidden AI Governance Failure
Listen to this article

Every marketing slide deck this quarter has a line like “our AI campaign lifted conversion 44% at 99% confidence.” Most of those slides are wrong, and the reason is a number nobody reports.

Ronny Kohavi, former Microsoft Distinguished Engineer and Airbnb VP, co-author of Trustworthy Online Controlled Experiments (Cambridge, 2020), posted a worked example on LinkedIn this month. Per Kohavi’s analysis, a 44% lift reported at 99% confidence, computed from an experiment with 6.9% statistical power, has an 87% chance of being a false positive. Eighty-seven percent. Not the inverse of 99%. Not 1%. Eighty-seven.

The “99% confidence” was honest in the way most marketing reporting is honest. It was also functionally meaningless, because power, not confidence, governs whether a result is real.

This is the missing layer in the AI marketing stack. We have spent a year arguing about consent architecture, citation visibility, and agent governance. Underneath all of that sits experiment governance, and right now most marketing organizations do not have one.

Confidence is not what people think it is

Confidence answers a narrow question. If there is no real effect, how often would I see a result this extreme by accident? Ninety-nine percent confidence means: 1% of the time, the noise alone would have given me this. That is useful, but only on the condition that the experiment was set up to detect a real effect in the first place.

Statistical power is the second condition. Power answers: if a real effect of size X exists, how often would my experiment actually catch it? The industry standard for controlled experiments is 80% power. Kohavi’s example sits at 6.9%.

Here is what 6.9% power means in practice. The experiment was set up so that even if a genuine 5% lift existed in the real world, the test would only detect it 7% of the time. That is not a measurement instrument. It is a coin landing slightly off-center. When such an experiment then reports a 44% lift, the most likely explanation is not “AI works that well.” The most likely explanation is “I caught a tail of the noise distribution and froze it as truth.”

Combine the two: low power plus a flashy effect plus high confidence equals an 87% false-positive rate, per Kohavi’s worked numbers. The frame to hold is “high confidence in a noisy instrument is not high confidence in the result.”

Why AI campaigns make this worse

AI-driven marketing optimization runs more experiments, faster, on smaller increments of traffic, with looser stopping rules than the controlled-experiment literature was built for. Each of those moves degrades power.

More variants split traffic, which shrinks per-arm sample size, which collapses power. Faster cycles tempt teams to call winners on day three of a fourteen-day test, which inflates false-positive rates further. Looser stopping rules (“we’ll just peek and call it when the line crosses”) destroy the statistical guarantees that the confidence number was based on in the first place.

The AI part of the story compounds the problem in a way that should worry any operator. Auto-optimizing systems feed yesterday’s “winner” back into tomorrow’s targeting. If the winner was noise, the system is now optimizing toward a phantom. Compound that across a quarter of weekly tests and you do not have a marketing engine. You have a slot machine with branding.

The marketing-governance arc gets closed here. We argued in the governance stack piece that consent is the only real governance column marketers own today, while measurement is observability dressed up as control. Experiment governance is the layer beneath both. If the measurement instrument is broken, no amount of dashboarding fixes the decisions that come out of it.

What experiment governance actually requires

Treat this the way engineering treats deploy governance. Three artifacts, every time, before the first user sees the test.

A power calculation. Before launch, compute the minimum detectable effect at 80% power given your traffic. If your weekly traffic only supports detecting a 12% lift at 80% power, then a 4% lift result is uninterpretable. Not “small but real.” Uninterpretable.

A sample-size and duration commitment. Decide, before launch, how long the test runs and what sample it needs. Write it down. Make stopping early require an explicit override with a reviewer, the way a hotfix to prod requires one.

A pre-registered hypothesis. State what you expect to see and what would falsify it. AI-generated variants are particularly prone to “we’ll try a hundred things and report the three that won.” Pre-registration kills that pattern.

None of this is novel. Trustworthy Online Controlled Experiments has been the operating manual for a decade. What is new is that AI optimization removed the slow, manual friction that used to enforce these disciplines by accident. Without that friction, the disciplines have to be governance, not habit.

The cultural failure underneath

The reason this layer is missing is not technical. It is incentive. A marketing team that reports “we ran 40 tests this quarter and 12 produced significant lifts” gets praised. A team that reports “we ran 40 tests this quarter, 27 were underpowered and uninterpretable, and 3 produced significant lifts we can defend” gets questioned. The first team is rewarded for noise. The second team is rewarded for honesty, but only by managers who understand the difference.

This is the same pattern engineering went through with flaky tests. For years teams shipped on green CI runs that were green because the flakes happened to land right that day. The fix was not better tests. The fix was treating flakes as bugs to be paid down, with metrics that surfaced them. Marketing experimentation is at the equivalent moment. Underpowered tests are flakes. They are not “small wins.” They are noise reports.

The CMO who learns to ask “what was the power of that test” before “what was the lift” is the one who stops setting strategy on coin flips. That question costs nothing. The cost is admitting that most of the quarter’s “wins” were not measurements.

Do this now

Pull the last ten experiment reports your team has run. For each, ask three questions. What was the statistical power at the lift size reported? What was the pre-committed sample size and duration, and did you honor it? What was the pre-registered hypothesis? If you cannot answer all three for most of the ten, you are not running experiments. You are running anecdotes with charts. Fix the next ten before you fix anything else, because every AI optimization decision downstream is only as trustworthy as the test that fed it.


This analysis draws on Ronny Kohavi’s post on A/B test power (LinkedIn, May 2026) and Kohavi, Tang & Xu, “Trustworthy Online Controlled Experiments” (Cambridge, 2020).

Victorino Group helps marketing and product teams design experiment governance that catches false positives before they ship as strategy. Let’s talk.

All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation