AI Code Review Is Setting Its Own Standards. Who Reviews the Reviewer?

TV
Thiago Victorino
9 min read
AI Code Review Is Setting Its Own Standards. Who Reviews the Reviewer?
Listen to this article

When we first wrote about Cursor’s BugBot in January, the resolution rate stood at 70%. Three months and 50,310 PRs later, it is 78.13%. That is a 26 percentage point increase since the July 2025 launch.

The number is impressive. What is more interesting is how it got there.

BugBot now runs across 110,000+ repositories with learning enabled. It has accumulated 44,000+ rules, each one generated from developer feedback. Rules auto-promote when evidence supports them. They auto-disable when developers reject their suggestions. The system writes its own standards and enforces them without waiting for a human to approve the policy change.

This is the part worth paying attention to.

Two Philosophies, Same Problem

Two articles published within 24 hours of each other in April 2026 reveal a fundamental tension in how engineering teams think about AI-assisted review.

Cursor’s BugBot (Michael Zhao, Cursor blog) reports benchmark results across six tools. BugBot leads at 78.13%. Greptile follows at 63.49%. CodeRabbit scores 48.96%. GitHub Copilot lands at 46.69%. Codex at 45.07%. Gemini Code Assist trails at 30.93%.

The methodology: public repositories only, with an LLM judge assessing whether AI-generated comments were addressed before merge.

Claude PR Review (Ikeh Akinyemi, LogRocket) takes the opposite approach. Instead of an autonomous system, Akinyemi built a five-agent pipeline: qualification, bug detection, git blame context, past PR patterns, and code-comment alignment. Each review passes through an 80-point confidence threshold. Scoring subagents are designed to disprove findings, not confirm them.

The system caught an auth bypass at confidence 100. A silent catch block scored 75 and was filtered out. The cost: $15-25 per review.

One system learns its own rules. The other follows rules a human wrote.

The Autonomy Spectrum

BugBot’s dashboard gives teams manual override, but the default is autonomous. Rules promote and demote based on aggregate developer behavior. If enough developers across enough repos accept a suggestion, it becomes a standard. If enough reject it, it disappears.

This is governance by consensus, measured in merge decisions.

Akinyemi’s Claude pipeline works differently. The CLAUDE.md file defines what correctness means for his team. The agents enforce those definitions. Nothing promotes or demotes without a human editing the config. His own conclusion: “The setup matters just as much as the tool. A CLAUDE.md that actually reflects your team’s correctness rules… that’s what separates signal from noise.”

Both approaches work. Neither answers the harder question: who decides what “correct” means when the tool is smarter than most of the reviewers?

The Measurement Problem

We covered the benchmark paradox in February: every vendor wins their own test. BugBot’s new data adds a wrinkle.

The methodology uses an LLM to judge whether AI comments were addressed. AI evaluating AI. The Cursor team acknowledges the limitation: public repos only, which may not reflect enterprise codebases. The LLM judge introduces circularity. A model trained on similar data evaluates another model’s output on that same data.

This does not invalidate the results. A 78% resolution rate, even measured imperfectly, is a real signal. Developers are accepting most of what BugBot suggests. But “developers accepted it” and “it was the right call” are different claims.

We have seen this before. The METR study showed developers perceived AI as 20% faster while actually being 19% slower. Acceptance is not accuracy.

The Cost Question

Akinyemi’s pipeline costs $15-25 per review. At scale, this adds up fast. A team merging 50 PRs per day spends $750-1,250 daily on review alone.

BugBot’s cost is bundled into Cursor’s subscription. This makes it cheaper per review but removes price as a quality signal. When review is free, there is no economic pressure to reduce false positives. The only pressure is developer annoyance, which is harder to measure than a budget line.

The economics push toward BugBot’s model: autonomous, bundled, high-volume. The governance question pushes toward Akinyemi’s: deliberate, configurable, expensive.

As we argued in You Are Not Killing Code Review, the question was never about the tool. It is about who owns the standard the tool enforces.

What 44,000 Rules Actually Means

Each of those 44,000 learned rules represents a decision about code quality that was made without a formal review process. No architecture review board approved them. No engineering manager signed off. They emerged from the aggregate behavior of developers interacting with suggestions.

Some of those rules are probably excellent. Some encode the preferences of the loudest or most active contributors. Some may reflect the biases of the repositories where BugBot sees the most traffic.

This is not a flaw. It is a design choice. And it is the design choice that matters most, because it determines who controls the standard. Cursor chose emergent consensus. Akinyemi chose explicit configuration. Most teams have not made this choice consciously. They adopted a tool and inherited its governance model by default.

The Real Question for Engineering Leaders

The tools are getting better. That is not in dispute. The question is organizational.

When BugBot auto-promotes a rule that conflicts with your team’s architecture decisions, who notices? When a confidence threshold filters out a real bug because it scored 75 instead of 80, who adjusts the threshold? When 44,000 rules accumulate over months, who audits them for consistency with your actual engineering standards?

Three questions every team should answer before adopting AI code review:

1. Who writes the rules? If the answer is “the tool learns them,” you have delegated a governance decision to an optimization function. That may be fine. But make it a conscious choice.

2. Who reviews the rules? BugBot’s dashboard exists. How many teams actually use it? The default is autonomous. Defaults become permanent.

3. What happens when the tool is wrong at scale? A bad rule that auto-promotes across 110,000 repositories affects more codebases than any single engineering decision ever made by a human.

These are not technical questions. They are organizational ones. The tools will keep improving. The 78% will become 85%, then 90%. The question of who governs the governor only gets harder as the tools get better.


This analysis builds on Cursor BugBot’s April 2026 benchmark report (April 2026) and Ikeh Akinyemi’s Claude PR review pipeline (April 2026), cross-referenced with our previous coverage of BugBot’s evolution, the benchmark paradox, and code review as governance.

Victorino Group helps engineering teams build governance frameworks for AI-assisted development before the tools make the decisions for them. Let’s talk.

All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation