When the AI Tests Pass and the Humans Don't: Three Verification Failures, One Pattern

Daniel sat down to test a marketing site that scored 100% on automated accessibility checks. Within ten minutes, the site had failed him in ways no scanner could see.

The site was built with Lovable, an AI tool that markets itself as producing accessible output by default. Axess Lab’s Hampus Sethfors ran Axe, the industry-standard automated checker. The dashboard reported a perfect score. Then he handed the page to Daniel, a real screen reader user. Daniel tried to open the menu and heard the announcement: “toggle menu.” Nothing else. No state. No “expanded” or “collapsed.” He told Sethfors, “It still says toggle menu, I’m not sure if it works because it doesn’t announce if I have expanded something.”

That is verification failure number one. Three named failures landed in the same week, in three different shapes, with the same root cause underneath. Each one is worth understanding on its own. The pattern they form is worth understanding more.

Lovable: 100% Score, Multiple Critical Failures

The Axess Lab test, published May 13, 2026, is the clearest demonstration of a problem the industry has been talking around for two years. Automated accessibility tooling can only test what it can parse. It checks markup, contrast, focus order, ARIA presence. It cannot check whether a screen reader user can actually accomplish a task on the page.

Lovable’s site passed every automated check. Daniel, using the assistive technology the score was meant to predict, found multiple critical blockers in his first ten minutes. The menu toggle did not announce state. Form fields lacked the context to fill them. The carousel was unusable without sighted navigation.

The diagnostic is not that Lovable’s AI is bad at accessibility. The diagnostic is that “100% accessibility” was never a property the AI could deliver. It was a property of a human judgment that someone replaced with a metric. The score is the contract the team thought they were signing. The user experience is the contract they actually signed.

Bun: 6,755 Commits, Zero Human Reviewers

Six days. 6,755 commits. Zero human reviewers.

That is the count Jiacai Liu pulled from the Bun repository’s commit log between May 8 and May 14, 2026, analyzing the Rust rewrite that the project is conducting at industrial scale. The code is written by Claude. The reviews are written by Claude. The merge decisions are made by Claude. No human is in the loop on any individual commit.

Liu, who has no relationship with Bun and analyzed the data as an outside observer, framed the concern in one sentence: “Code you don’t understand should not run in production.”

The Bun team would presumably argue that the test suite is the verifier, that the metrics will catch regressions, that the scale of generation justifies the absence of human review. That argument is the same shape as Lovable’s accessibility score. Both delegate human judgment to an automated signal. Both assume the signal captures what matters.

The Lovable case demonstrates how that assumption can break. Daniel could not open the menu, and the score said the site was perfect. If a screen reader exposes a category of failure that Axe cannot detect, what category of failure does a test suite fail to detect in 6,755 commits worth of new Rust code?

We do not know yet. We will know in six months, when the failure mode arrives in production and someone has to debug a system that no living engineer fully read.

Aviator: Verification Works, When You Did the Spec

The third case complicates the story in a useful way. Ankit Jain at Aviator published an experiment on May 17, 2026, running spec-based review across 6,000 lines of generated code. The team extracted 65 checkable acceptance criteria from the spec. A reviewer agent validated all 65 in roughly six minutes. The result: 60 pass, 4 fail, 1 partial.

This is verification that scales. Six minutes of automated review, anchored to specification, replaced what would have been hours of human PR review. The four failures were caught. The one partial was flagged. The work could move forward with the confidence the verification provided.

But Jain wrote the sentence that should be on every engineering leader’s wall: “You cannot write tests against requirements you didn’t know to articulate.”

Spec-based verification works only if someone did the cognitive work of writing the spec. That work cannot be delegated to the same model that will generate the code. It is the human judgment that converts intent into checkable claims. It is also the work that most teams skip, because it feels slow and the AI feels fast.

Frederick Vanbrabant modeled this trade-off in a hypothetical Gantt chart published May 15, 2026. A traditional project might look like 70 days of development plus 10 days of scoping. With AI, the development phase collapses to roughly 3 days. The total project does not shrink, because the scoping and documentation phase expands to about 40 days. The bottleneck moved. It did not disappear.

The Pattern Underneath

Three cases. Three different verification shapes. One root cause.

Lovable replaced the screen reader user with Axe. Bun replaced the human reviewer with Claude. Vanbrabant’s hypothetical organization tried to replace the spec writer with whoever happened to be holding the prompt. In every case, a category of human judgment was delegated to a system that could not hold it.

The verification debt thesis (covered previously in The AI Verification Debt and The AI Verification Tax) treated the problem as a measurement gap: developers do not trust AI output and do not verify it systematically, so unreviewed code accumulates. These three cases extend the diagnosis. The problem is not only that verification is skipped. The problem is that verification is performed against a proxy that the team mistook for the real thing.

A 100% accessibility score is a proxy for “blind users can use this site.” A passing test suite is a proxy for “the new code does what the old code did.” A reviewer agent’s pass-rate is a proxy for “this code matches what we actually meant to build.” Each proxy has a domain of validity. None of them captures the full property the team needs.

Aviator’s experiment is instructive precisely because it surfaces the limit. The 60 passing criteria do not mean the code is correct. They mean the code satisfies the 65 things the team knew to ask about. Whatever the team did not articulate, the verification cannot catch. The reviewer agent is honest about its scope. The team’s spec work is the substrate that gives the score meaning.

Where the Real Work Moved

If you accept the Vanbrabant model (development time collapses, scoping time expands), the implications for engineering leadership are direct.

The bottleneck for AI-assisted development is no longer typing speed. It is articulation speed. How fast can your team translate “we need a checkout flow that works for screen reader users” into a list of checkable acceptance criteria that a reviewer agent can validate? How fast can you turn “the new Rust port should preserve the behavior of the existing JavaScript implementation” into a property-based test suite that catches the cases your generation model will not catch on its own?

That work is human. It is not optional. It is the surface that determines whether the verification you delegate to AI is actually verifying what matters or just generating green dashboards.

The Lovable case names the failure mode in its sharpest form. A real user, with a real assistive technology, found real failures, in real time, that no automated check would ever surface. The site had a 100% score. Daniel could not use it.

If your verification surface looks more like the Axe score than like ten minutes with Daniel, you are accumulating the kind of debt that arrives as a customer support ticket, an accessibility lawsuit, or a production outage that no one on the current team can debug.

Do This Now

Audit your last 30 days of AI-assisted output against one question: for each verification gate that signed off on shipping, what was the underlying property the gate was a proxy for, and how confident are you that the proxy captures it?

If your AI-generated code passes a test suite, name three failure modes the suite does not cover. If your AI-built UI passes an accessibility scanner, run it past one real screen reader user this month. If your AI-generated commits merge automatically, write down which class of regression you are willing to ship to production without human review, and which class you are not, and make that line explicit in the merge automation.

Three cases in one week is not a coincidence. It is the industry learning, in public, that the verification surface inherited from the pre-AI era was built for a code-generation rate that no longer applies. The new rate demands a new surface. The teams building that surface now will compound the advantage. The teams trusting the green dashboards will compound the debt.

This analysis synthesizes Lovable’s 100% Accessible Site (Axess Lab, May 2026), My Thoughts on Bun’s Rust Rewrite (Jiacai Liu, May 2026), How to Avoid AI Code Slop (Engineering Leadership Newsletter, May 2026), and I Don’t Think AI Will Make Your Processes Go Faster (Frederick Vanbrabant, May 2026).

Victorino Group helps enterprises design verification gates that protect real users, not green dashboards. Let’s talk.