A Home-Grown Agent Team Beat 10+ Security Scanners. The Architecture Is Why.

Ramp’s security engineer Eli Block reported a number that should reframe how every platform team thinks about vulnerability work. A small team built an agent system that found, validated, and fixed roughly 100 latent security issues across the codebase in six days, with no human involved until the pull request review. The system was assembled in a four-hour hackathon and finished in under a week, by one team member. The issues it caught had already survived penetration testing, a bug bounty program, static analysis, and more than ten commercial code-scanning vendors.

That last detail is the one worth sitting with. The scanners were not absent. They were running. They missed the issues anyway, and a home-grown agent team caught them. The reason is not a smarter model. The reason is context, and context is an architecture decision.

Why Context Beats Coverage

A commercial scanner is built to be generic. It has to work across every customer’s codebase, which means it cannot encode any single customer’s threat model, naming conventions, internal trust boundaries, or which endpoints actually face the internet. It pattern-matches against a library of known signatures and reports what fits. Breadth is its product. Specificity is its blind spot.

An agent team built inside the codebase inherits the opposite tradeoff. It can read the repo’s own conventions, the team’s priorities, the internal documentation that says which service is privileged and which is sandboxed. When it evaluates a candidate finding, it is not asking “does this match a CWE signature.” It is asking “is this exploitable given how this system is actually wired.” That second question is the one that separates a real issue from a theoretical one, and it is the question generic tooling structurally cannot ask.

This is the part that should change a budget conversation. The advantage was not the agents being clever. The advantage was the agents carrying the same context a senior engineer carries, applied at a volume no senior engineer has time for. The scanner sells coverage. The team built judgment.

The Labor Cost Collapse

Security teams have always known about more issues than they fix. The bottleneck was never detection. It was the labor chain after detection: validate the finding, confirm it is real and not noise, write a fix, write a test that proves the fix, confirm nothing broke. Each step costs human hours, and human hours are rationed. So teams triage to criticals, and the long tail of medium and low findings accumulates as latent risk that no one has time to touch.

Agents collapse that chain. When find, validate, fix, and confirm all run at machine cost, the economics of triage invert. You no longer have to ignore the broad finding set because acting on it is too expensive. You can act on the whole set. Ramp’s hundred issues were not all criticals; they were the backlog that conventional staffing would never have justified clearing. The agents made clearing it affordable.

That is the strategic shift hiding inside the headline. Cheaper labor for the full lifecycle means a security team can finally operate on the distribution it always saw, instead of the thin slice it could staff.

The Architecture Is the Lesson

The autonomy here is not “let a model loose on the codebase.” It is a governed pipeline with specialized roles and an adversarial check built into the middle. The shape, as Ramp described it, runs like this.

A coordinator routes the work. Detector agents scan for candidate issues, each tuned to a class of problem. Then comes the load-bearing piece: manager agents act as adversarial judges of the detectors’ proposals. Their job is to reject, not to approve. Validator agents write tests that attempt to prove the issue is real. A fixer agent works in a test-driven loop, writing the patch against the failing test until it passes. Only then does a human see a pull request.

The number that proves the design works is the rejection rate. The adversarial manager agents threw out 40% of what the detectors proposed, and human review confirmed every rejection was a genuine false positive. That is the whole game. A detector optimized for recall will over-report; that is what detectors do. Putting an agent whose entire incentive is to reject in front of the human gate means the false positives die at machine cost, before they ever consume a reviewer’s attention or erode trust in the system.

Notice what each role contributes to governance. The coordinator gives you a single point of orchestration to audit. The detector-manager split separates “find everything plausible” from “keep only what survives challenge,” so neither role is asked to do both jobs at once. The validator turns a claim into a reproducible test, which is the only honest proof a vulnerability exists. The TDD fixer guarantees the patch is tied to that test rather than to a vibe. And the human PR gate sits at the end, reviewing a small, pre-filtered, test-backed set instead of a flood of raw findings.

This is the same pattern we have argued for in the agent security architecture problem: autonomy is safe when it is structured, and dangerous when it is monolithic. It also extends the point from AI security research and its governance implications, where the open question was “AI can find vulnerabilities, now what.” Ramp answers the “now what” with org design. The answer is a pipeline where the human gate is at the end, the adversarial gate is in the middle, and every step produces an artifact you can audit.

Where Teams Get This Wrong

The tempting misread is “buy fewer scanners, build agents instead.” That is not the lesson. The lesson is that detection without a validation-and-rejection pipeline produces noise, and noise is what kills security automation. Most teams that wire up an LLM to scan code skip the adversarial manager and the test-writing validator, then drown their reviewers in plausible-looking false positives. Within a month the reviewers stop trusting the output, and the system is shelved.

The 40% rejection rate is not a footnote. It is the feature. A pipeline that cannot reject itself is a pipeline that exports its noise to humans. Build the rejecter before you scale the detector.

Do This Now

Map your current vulnerability flow against the five roles. Where does detection happen, who validates a finding is real, who writes the proof, who writes the fix, and where does the human actually look. If detection feeds straight to a human queue with no adversarial filter in between, you have a recall machine pointed at your reviewers, and it will burn their trust before it earns it.

Then ask the budget question honestly. You are likely paying for breadth (multiple scanners) while starving the part that converts findings into safe, validated fixes (the labor chain). Ramp’s result suggests the leverage is in the pipeline, not the detectors. Build the adversarial manager and the TDD validator first. Put the human at the PR gate, reviewing a filtered, test-backed set. Then let detection run as wide as you want, because the rejecter downstream makes breadth cheap instead of dangerous.

This analysis synthesizes We Proactively Fixed ~100 Security Issues in 6 Days With 0 Humans (Ramp, February 2026).

Victorino Group helps teams design governed agent architectures with humans at the right gate. Let’s talk.