AI Code Review: Lessons from Cursor BugBot

Code agents have increased PR production, but created a new problem: human review time does not scale.

Teams adopting AI for coding report increased cognitive load, review bottlenecks, and higher risk of defects escaping to production. Adoption of code review agents grew from 14.8% in January to 51.4% in October 2025.

The most alarming data point: code churn — lines reverted in less than two weeks — doubled with AI-generated code. Up to 40% of alerts from review tools are ignored.

The BugBot Case

Cursor’s BugBot processes over 2 million PRs per month. In 6 months and 40 experiments, the team increased resolution rate from 52% to 70% and bugs found per run from 0.4 to 0.7.

Resolution rate is the primary metric: percentage of reported bugs that were effectively fixed by the author in the final code. This metric uses AI to automatically classify whether each bug was resolved at merge time.

Evolution: Static Pipeline to Autonomous Agent

Static Pipeline (V1)

Fixed sequence of steps, pre-defined context:

8 parallel passes with randomized diff order
Combining similar bugs into buckets
Majority voting to filter false positives
Merging each bucket into a single description
Filtering unwanted categories
Final validator to catch false positives
Deduplication against previous runs

Agentic Architecture (V11)

Agent decides where to investigate more deeply:

Reasoning about the diff with tool-calling capability
Autonomous decision on investigation depth
Dynamic context: fetches information as needed
Aggressive prompts encouraging complete investigation
Rich experimentation surface via toolset
Model pulls additional context at runtime
Tool design adjustments impact results

Dynamic Context: Less Is More

Providing fewer details initially allows the agent to pull relevant context on its own. This approach reduced total agent tokens by 46.9% in A/B tests, while simultaneously improving response quality.

Files as Interface: Long outputs become files the agent can read selectively.

Dynamic context techniques:

Tool outputs converted to files
Chat history as reference material
Skills with minimal descriptions + dynamic lookup
MCP tool descriptions synced to folders
Terminal sessions integrated with filesystem

Benefits: Token efficiency + response quality. Less contradictory information results in better reasoning.

Deterministic Tools

Tool-augmented agents delegate specific tasks to static analysis, reducing tokens and hallucinations.

The hybrid architecture runs deterministic checks first, then uses the LLM only for semantic reasoning that tools cannot do.

Hybrid Pipeline:

Linters and SAST run first (deterministic)
AST parsing structures code semantically
LLM receives results + diff (fewer tokens)
Agent focuses on logical and contextual bugs

Tools agents use:

Linters: ESLint, Ruff, golangci-lint
Type Checkers: Mypy, TypeScript, fbinfer
AST Parsers: Tree-sitter, ast-grep, OXC
SAST: Semgrep, CodeQL, Checkmarx
MCP Servers: Expose tools via protocol

Why it works: Deterministic tools provide “ground truth” for critical operations. The LLM does not need to spend tokens detecting syntax or type errors — it focuses on what truly matters.

Infrastructure for Scale

Robust Git Integration

BugBot rebuilt Git integration in Rust for speed and reliability. Minimize fetched data, use efficient caching.

Rate Limiting and Batching

Rate limit monitoring and request batching to operate within GitHub constraints.

Customizable Rules

BugBot Rules allows encoding codebase-specific invariants without hardcoding in the system:

Unsafe migrations
Incorrect internal API usage
Project conventions

Metrics for AI-Powered DevEx

Modern DevEx requires metrics beyond speed:

Flow: Can developers achieve uninterrupted deep work?

Clarity: Do they understand code and context quickly?

Quality: Does the system resist drift and degradation?

Energy: Are work patterns sustainable?

Governance: Does AI behave predictably and traceably?

Code Review Metrics

Resolution Rate: % of reported bugs that were fixed. BugBot’s primary metric.
Inspection Rate: LOC / Review Hours. Benchmark: 150-500 LOC/hour.
Change Failure Rate: DORA metric. Canary for quality problems.
Time to First Review: Recommended target: < 24h. Directly impacts flow.

The Perception Paradox

METR Study (2025): Experienced developers working on their own open-source repositories showed surprising results.

Actual time with AI: +19% (slower)
Prior expectation: -24% (thought it would be faster)

Even after experiencing the delay, developers still believed AI accelerated them by 20%.

Implications for Leaders:

Do not trust perceptions — measure objectively
60% of leaders cite lack of clear metrics as biggest challenge
Baseline before adopting: Cycle time, quality, satisfaction
Compare 3-6 months later: Real data vs. expectations

Treat DevEx as a systems design problem, not a cultural initiative. Define concrete metrics before scaling AI tools.

The False Positives Problem

AI code review tools typically operate with 5-15% false positive rate. But the credibility cost is high.

Why tools fail:

Reading diffs without project context
Syntax-based checks, not intent
No awareness of internal conventions
Static checks on dynamic behavior
Hallucinations from generalist LLMs

Result: Up to 40% of alerts are ignored. Automation generates noise instead of actionable insights.

Mitigation Strategies

Majority Voting: Multiple parallel passes. Real bug = stronger signal.

Feedback Loop: Developers mark false positives. System learns.

Severity Calibration: Start restrictive, relax rules with too much noise.

Prompting Inversion

Traditional Approach

Restrict the model to minimize false positives:

“Be conservative. Only report bugs if you have high certainty. Avoid false alarms.”

Result: Model too cautious, misses real bugs.

Agentic Approach

Encourage aggressive investigation:

“Investigate every suspicious pattern. Err on the side of reporting. Use tools to verify hypotheses.”

Result: Agent explores more, uses tools to validate before reporting.

In agentic architecture, the ability to call tools and fetch additional context fundamentally changes prompting strategy. The model can investigate before concluding.

Implementation Roadmap

Phase 1 - Baseline (4-6 weeks):

Measure current cycle time
Document code quality
Satisfaction survey
Map review bottlenecks

Phase 2 - Pilot (4-6 weeks):

Select pilot team
Configure initial rules
Calibrate sensitivity
Collect weekly feedback

Phase 3 - Iteration (3-6 months):

Analyze resolution rate
Adjust rules by feedback
Add custom rules
Compare with baseline

Phase 4 - Scale (ongoing):

Expand to other teams
Monitor DORA metrics
Integrate into onboarding
Document playbooks

Do not rush calibration — lost credibility is hard to recover.

The Future of AI Code Review

Code Execution: Agents running code to verify their own bug reports.

Autofix: Agent that not only finds but automatically fixes bugs.

Continuous Monitoring: Constant codebase scanning, not just on PRs.

BugBot today is multiple times better than at launch. In a few months, it will be significantly better again.

At Victorino Group, we implement governed AI systems for engineering teams that need quality without sacrificing speed. If you want to scale code review with AI while maintaining control, let’s talk.