Your AI Feature Is Done When You've Measured the Variance and Rehearsed the Recovery

A deterministic feature is done when the tests pass. You write the spec, you write the assertions, green check, ship. The definition of done has been stable for thirty years because the behavior is stable: same input, same output, every time.

AI features break that contract at the root. The same prompt against the same model can return a defensible answer on Tuesday and a confident fabrication on Wednesday. There is no assertion that turns red on the second case, because the output was never wrong in the sense your test suite understands. It was plausible. It was just false. Jeff Gothelf, who has spent fifteen years writing about how product teams define work, put a sharp edge on the problem in June 2026: “Done is a calibration about an acceptable variance in output and behavior, not a binary result about specification adherence.”

Read that twice. Done is a calibration. The word demands a number, an owner, and a rehearsal. Most teams shipping AI features have none of the three.

A Number From a Real Evaluation

The abstraction stays comfortable until someone measures it. In May 2026, MeasuringU ran an experiment that should be taped to the wall of every team putting AI into a workflow. They took a single six-minute usability video, the kind a UX researcher reviews to find friction in an interface, and they asked humans and frontier models to find the problems independently.

The models were not toys. ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking, four runs each, the configurations a serious team would actually deploy. Humans found 9 problems. The AI surfaced 14. Only 3 problems overlapped across all parties. So far this reads like a win for the machines: more findings, broader coverage, faster.

Then they checked the 11 findings the AI raised that no human did. One was a genuine insight a human had missed. One. The other ten split into seven false alarms and three hallucinations: problems described with full confidence that did not exist in the video at all. Put as rates, 9% genuine, 64% false alarm, 27% fabricated. To harvest the single real insight the AI added, a human had to wade through ten distractions, three of which were inventions.

That is the variance Gothelf is talking about, expressed as a measurement instead of a worry. The AI was useful. It was also wrong most of the time it spoke up. Both statements are true at once, and a definition of done that only captures the first one is not a definition. It is a marketing slide.

Why “Tests Pass” Cannot Carry This Weight

The instinct is to reach for the old tool: write more tests, tighten the assertions, raise the bar until the noise stops. It does not work, and the reason is structural rather than a matter of effort.

A test encodes an expected output. A probabilistic feature has an expected distribution. You can assert that a function returns 4 for an input of 2 plus 2. You cannot assert that a usability reviewer returns exactly these nine findings, because the acceptable answer is a spread, and the line between a creative-but-valid finding and a confident hallucination is precisely the judgment you were hoping to automate away. The 64% false-alarm rate is not a bug you can patch. It is a property of the tool at its current capability, and it will shift, up or down, with every model update you do not control.

So acceptance has to move from a single point to a band. Done is no longer “it returned the right answer.” Done becomes “across N runs, the output stayed inside a distribution we agreed to live with, and we know what happens when it drifts outside that band.” The MeasuringU result gives you the shape of the question every team now has to answer before shipping: what false-alarm rate can the people downstream actually absorb, and at what point does the cost of filtering the noise erase the value of the signal?

The Three Things “Done” Now Requires

Gothelf names the second half of the definition cleanly: “You are done when the people downstream of the feature know what to do when it misbehaves.” Not if it misbehaves. When. That single word reorganizes the work. Here is the operational form.

An accepted output distribution. Before you ship, run the feature enough times on representative inputs to characterize its spread, the way MeasuringU characterized four runs each across two models. Write the tolerance down as a number your team signs off on. Not “it works well.” A rate: this is the false-positive level we accept, this is the floor of genuine signal we require, and below this we do not ship.

A named owner for failure triage. When the feature emits a confident fabrication into a downstream workflow, exactly one person owns the response. Not the team. A name. The MeasuringU finding makes the cost concrete: every false positive the AI raises is human time spent disproving a problem that was never there. Someone has to own that filtering loop, decide what reaches the customer, and carry the budget for the verification it demands.

A rehearsed rollback tied to a tripwire. A monitoring signal watches the live distribution. When the false-positive rate crosses the band you accepted, an alarm fires and a rollback you have already practiced executes. Rehearsed is the load-bearing word. A rollback you have never run is a hope, not a control. The first time you exercise it cannot be during the incident.

Three artifacts. None of them is code. All of them are the feature.

The Human Loop Is the Product

The uncomfortable conclusion of the MeasuringU numbers is that the human oversight loop is not a temporary scaffold you remove once the model improves. At a 64% false-alarm rate, the reviewer who filters the output is doing the load-bearing work. The model widens the search; the human decides what is real. Take the human out and you ship the 64% straight into the workflow, where it costs more to clean up than it ever saved to generate.

This is why “done” for an AI feature has to include the loop. We have written before about how to build the catch-system that intercepts hallucinations and about the verification tax that oversight imposes on every output. The definition of done is where those two ideas become a contract: you do not get to call the feature finished until the catch-system exists, the tax has an owner who is funded to pay it, and the layered review that paradoxically can add risk has been tuned so the filtering does not cost more than the signal is worth.

Do This Now

Pick one AI feature you have already shipped. Ask three questions, out loud, with the team in the room.

First: what is the accepted output distribution, as a number? If the answer is a feeling rather than a rate, you shipped a calibration you never calibrated. Run it twenty times on real inputs this week and write down the spread.

Second: who triages a confident fabrication, by name? If the answer is “the team” or “we would notice,” nobody owns it and the failure will route to whoever is unlucky.

Third: when did you last rehearse the rollback? If the answer is never, you have a hope with a deploy button, not a control.

A deterministic feature was done when the tests went green. An AI feature is done when the variance is measured, the owner is named, and the recovery has been run at least once before it was needed. Everything short of that is a checkbox pretending to be a calibration.

This analysis synthesizes What “Done” Means When You’re Shipping AI Features (Jeff Gothelf, June 2026), Does AI Find Real UI Problems or Just Hallucinations? (MeasuringU, May 2026).

Victorino Group helps teams define operational acceptance for AI features so they ship with the oversight loop already in place. Let’s talk.