Governance as Advantage

AI Code Review: Lessons from Cursor BugBot

TV
Thiago Victorino
12 min read

Code agents have increased PR production, but created a new problem: human review time does not scale.

Teams adopting AI for coding report increased cognitive load, review bottlenecks, and higher risk of defects escaping to production. Adoption of code review agents grew from 14.8% in January to 51.4% in October 2025.

The most alarming data point: code churn — lines reverted in less than two weeks — doubled with AI-generated code. Up to 40% of alerts from review tools are ignored.

The BugBot Case

Cursor’s BugBot processes over 2 million PRs per month. In 6 months and 40 experiments, the team increased resolution rate from 52% to 70% and bugs found per run from 0.4 to 0.7.

Resolution rate is the primary metric: percentage of reported bugs that were effectively fixed by the author in the final code. This metric uses AI to automatically classify whether each bug was resolved at merge time.

Evolution: Static Pipeline to Autonomous Agent

Static Pipeline (V1)

Fixed sequence of steps, pre-defined context:

  • 8 parallel passes with randomized diff order
  • Combining similar bugs into buckets
  • Majority voting to filter false positives
  • Merging each bucket into a single description
  • Filtering unwanted categories
  • Final validator to catch false positives
  • Deduplication against previous runs

Agentic Architecture (V11)

Agent decides where to investigate more deeply:

  • Reasoning about the diff with tool-calling capability
  • Autonomous decision on investigation depth
  • Dynamic context: fetches information as needed
  • Aggressive prompts encouraging complete investigation
  • Rich experimentation surface via toolset
  • Model pulls additional context at runtime
  • Tool design adjustments impact results

Dynamic Context: Less Is More

Providing fewer details initially allows the agent to pull relevant context on its own. This approach reduced total agent tokens by 46.9% in A/B tests, while simultaneously improving response quality.

Files as Interface: Long outputs become files the agent can read selectively.

Dynamic context techniques:

  • Tool outputs converted to files
  • Chat history as reference material
  • Skills with minimal descriptions + dynamic lookup
  • MCP tool descriptions synced to folders
  • Terminal sessions integrated with filesystem

Benefits: Token efficiency + response quality. Less contradictory information results in better reasoning.

Deterministic Tools

Tool-augmented agents delegate specific tasks to static analysis, reducing tokens and hallucinations.

The hybrid architecture runs deterministic checks first, then uses the LLM only for semantic reasoning that tools cannot do.

Hybrid Pipeline:

  1. Linters and SAST run first (deterministic)
  2. AST parsing structures code semantically
  3. LLM receives results + diff (fewer tokens)
  4. Agent focuses on logical and contextual bugs

Tools agents use:

  • Linters: ESLint, Ruff, golangci-lint
  • Type Checkers: Mypy, TypeScript, fbinfer
  • AST Parsers: Tree-sitter, ast-grep, OXC
  • SAST: Semgrep, CodeQL, Checkmarx
  • MCP Servers: Expose tools via protocol

Why it works: Deterministic tools provide “ground truth” for critical operations. The LLM does not need to spend tokens detecting syntax or type errors — it focuses on what truly matters.

Infrastructure for Scale

Robust Git Integration

BugBot rebuilt Git integration in Rust for speed and reliability. Minimize fetched data, use efficient caching.

Rate Limiting and Batching

Rate limit monitoring and request batching to operate within GitHub constraints.

Customizable Rules

BugBot Rules allows encoding codebase-specific invariants without hardcoding in the system:

  • Unsafe migrations
  • Incorrect internal API usage
  • Project conventions

Metrics for AI-Powered DevEx

Modern DevEx requires metrics beyond speed:

Flow: Can developers achieve uninterrupted deep work?

Clarity: Do they understand code and context quickly?

Quality: Does the system resist drift and degradation?

Energy: Are work patterns sustainable?

Governance: Does AI behave predictably and traceably?

Code Review Metrics

  • Resolution Rate: % of reported bugs that were fixed. BugBot’s primary metric.
  • Inspection Rate: LOC / Review Hours. Benchmark: 150-500 LOC/hour.
  • Change Failure Rate: DORA metric. Canary for quality problems.
  • Time to First Review: Recommended target: < 24h. Directly impacts flow.

The Perception Paradox

METR Study (2025): Experienced developers working on their own open-source repositories showed surprising results.

  • Actual time with AI: +19% (slower)
  • Prior expectation: -24% (thought it would be faster)

Even after experiencing the delay, developers still believed AI accelerated them by 20%.

Implications for Leaders:

  • Do not trust perceptions — measure objectively
  • 60% of leaders cite lack of clear metrics as biggest challenge
  • Baseline before adopting: Cycle time, quality, satisfaction
  • Compare 3-6 months later: Real data vs. expectations

Treat DevEx as a systems design problem, not a cultural initiative. Define concrete metrics before scaling AI tools.

The False Positives Problem

AI code review tools typically operate with 5-15% false positive rate. But the credibility cost is high.

Why tools fail:

  • Reading diffs without project context
  • Syntax-based checks, not intent
  • No awareness of internal conventions
  • Static checks on dynamic behavior
  • Hallucinations from generalist LLMs

Result: Up to 40% of alerts are ignored. Automation generates noise instead of actionable insights.

Mitigation Strategies

Majority Voting: Multiple parallel passes. Real bug = stronger signal.

Feedback Loop: Developers mark false positives. System learns.

Severity Calibration: Start restrictive, relax rules with too much noise.

Prompting Inversion

Traditional Approach

Restrict the model to minimize false positives:

“Be conservative. Only report bugs if you have high certainty. Avoid false alarms.”

Result: Model too cautious, misses real bugs.

Agentic Approach

Encourage aggressive investigation:

“Investigate every suspicious pattern. Err on the side of reporting. Use tools to verify hypotheses.”

Result: Agent explores more, uses tools to validate before reporting.

In agentic architecture, the ability to call tools and fetch additional context fundamentally changes prompting strategy. The model can investigate before concluding.

Implementation Roadmap

Phase 1 - Baseline (4-6 weeks):

  • Measure current cycle time
  • Document code quality
  • Satisfaction survey
  • Map review bottlenecks

Phase 2 - Pilot (4-6 weeks):

  • Select pilot team
  • Configure initial rules
  • Calibrate sensitivity
  • Collect weekly feedback

Phase 3 - Iteration (3-6 months):

  • Analyze resolution rate
  • Adjust rules by feedback
  • Add custom rules
  • Compare with baseline

Phase 4 - Scale (ongoing):

  • Expand to other teams
  • Monitor DORA metrics
  • Integrate into onboarding
  • Document playbooks

Do not rush calibration — lost credibility is hard to recover.

The Future of AI Code Review

Code Execution: Agents running code to verify their own bug reports.

Autofix: Agent that not only finds but automatically fixes bugs.

Continuous Monitoring: Constant codebase scanning, not just on PRs.

BugBot today is multiple times better than at launch. In a few months, it will be significantly better again.


At Victorino Group, we implement governed AI systems for engineering teams that need quality without sacrificing speed. If you want to scale code review with AI while maintaining control, let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation