The Best Model Finishes 7% of Real Legal Tasks. Now What?

When Harvey launched its Legal Agent Benchmark, the interesting question was not whether frontier models could do legal work. It was how you would ever know. A benchmark of 1,200 tasks across 24 practice areas, scored by lawyers, is a measuring instrument. The first readings are now in, and they are humbling in a way that should reset how regulated teams talk about agent readiness.

Harvey reports that under a strict all-pass standard, where every rubric criterion on a task must be satisfied for the task to count, the best frontier model completes only 7.1% of real legal tasks end-to-end. Not 71%. Seven. The leaderboard, as Harvey publishes it: Claude Opus 4.7 at 7.1%, Sonnet 4.6 at 5.4%, Opus 4.6 at 4.2%, GPT-5.5 at 2.1%, Gemini 3.5 Flash at 0.8%.

Two caveats before the lessons. Harvey is a legal-AI vendor publishing a benchmark it built, so read the absolute numbers as vendor-sourced. And “all-pass” is deliberately unforgiving: a task that nails nine of ten criteria scores zero. That severity is the point. In law, a brief that is 90% correct is not 90% useful. It is a liability with good formatting.

A sub-10% ceiling is a strategy signal, not a verdict

The reflex when you see 7.1% is to conclude the models are not ready. That is the wrong read. The right read is that legal work is nowhere near saturated, and the distance between a demo and production-grade legal output is enormous and now measurable.

This matters because the market has been pricing legal AI as if the hard part were solved and the remaining work were integration. The all-pass ceiling says otherwise. If the best available model gets seven of a hundred real tasks fully right on its own, then the value of a legal AI product is not the model. It is everything wrapped around the model: retrieval, validation, the workflow that catches the three criteria the model missed before they reach a partner.

A 7% ceiling on the raw model is the strongest possible argument for the systems layer. The vendors who win will not be the ones with privileged model access. They will be the ones who turn a 7% model into a 70% workflow.

Jagged intelligence: no single model wins

The leaderboard ranks models head to head, which invites a tempting simplification: pick Opus 4.7, it scored highest, done. Harvey’s own framing resists this, and the resistance is the more useful finding.

Performance is jagged. A model that leads on litigation drafting can trail on tax structuring or regulatory analysis. The aggregate score hides per-practice-area inversions: the model you would route a securities question to is not the model you would route an employment matter to. Intelligence is not a single dimension you can rank. It is a surface with peaks and valleys that differ by domain.

The operational consequence is direct. A production legal agent cannot be single-model. It has to route. The right architecture treats the frontier models as a panel of specialists and sends each task to whichever model peaks on that practice area. This is not a hedge against any one vendor. It is the only way to harvest the best available performance across a jagged surface, because no single point on that surface is highest everywhere.

That has procurement implications most teams have not absorbed. If your legal AI strategy is “we standardized on one model,” you have locked yourself to that model’s valleys. The benchmark says the valleys are deep.

Cost and latency move in the wrong direction

Here is the detail that complicates the multi-model story. The top scorer is also the most expensive and the slowest. Harvey reports Opus 4.7 at roughly $50.90 per task and about 22 minutes of wall-clock time. The model that gets seven tasks right out of a hundred costs fifty dollars and twenty-two minutes to do it.

For a partner billing at high rates, fifty dollars and twenty-two minutes for a usable first draft is trivially worth it. For an agent that runs thousands of tasks a day across a firm, the math inverts fast. The routing layer is not only choosing the model that scores highest on a practice area. It is choosing the model that clears the quality bar at acceptable cost and latency. Sometimes that is the cheaper model that scores a point lower. Sometimes the 22-minute, fifty-dollar answer is the only one that clears the bar, and you pay it.

Routing, in other words, is a three-axis decision: accuracy, cost, latency. Treating it as accuracy-only is how legal AI budgets detonate in month three.

The trajectory matters as much as the answer

The most important finding in Harvey’s results is not on the leaderboard. It is in the behavior.

Harvey scored not just what the agents produced but how they got there: the trajectory of reading, searching, drafting, validating, revising. And the trajectory predicts the score. Specific behaviors lift all-pass performance, and specific behaviors sink it.

The lifts: revising after a self-check adds about 1.5 all-pass points. Running a validation step after drafting adds about 0.8. The drops: drafting without any review costs about 1.2 points. Noisy tool use, defined as five or more tool calls in a single turn, costs about 0.5.

Read those numbers together and a discipline emerges. The agents that win do not produce more. They produce, then check, then revise. They search with intent rather than spraying tool calls and hoping. The behavioral profile of a good legal agent looks like the behavioral profile of a good junior associate: draft, verify against source, fix what verification surfaced, then hand it up.

This is the part that should reshape governance. If trajectory predicts quality, then governing legal agents means governing behavior, not just sampling outputs. An agent that produces a correct brief by drafting blind and getting lucky is not a governed agent. It is an ungoverned one that has not failed yet. The behavior is the control surface. The output is the lagging indicator.

Governance in law is governance-as-measurement

Put the three findings together and a posture falls out. The model alone clears 7% of tasks. No single model is best across practice areas. The behavior on the way to the answer predicts whether the answer holds. None of that is governable by reviewing final documents after the fact.

Governance in regulated legal work has to be governance-as-measurement: a procurement-grade, all-pass bar applied to behavior, not a spot-check applied to outputs. That means three commitments most legal-AI deployments do not yet make.

Hold an all-pass bar. A task is done when every criterion passes, not when most do. Ninety percent in law is a failure with good grammar.

Measure the trajectory. Instrument what the agent reads, searches, drafts, validates, and revises. Reward revise-after-check. Flag draft-without-review and noisy tool fan-out as defects, even when the final answer happens to be right.

Route across a panel. Treat models as specialists, score them per practice area on your own tasks, and let cost and latency into the routing decision. A model that scores highest on a leaderboard you did not build is not evidence about your matters.

Do this now

Take one legal workflow you are tempted to automate. Define its all-pass rubric: list every criterion a partner would require, and accept nothing less than all of them. Run your candidate model against ten real tasks under that bar and record the raw end-to-end pass rate. It will be lower than you expect, and that number is your honest starting line. Then instrument the trajectory on those same ten: does the agent validate after drafting, does it revise after checking, does it spray tool calls? You now have two controls Harvey’s data says predict quality, and a measurement posture that survives a regulator asking how you know the agent is safe. The benchmark’s lesson is not that legal AI is far off. It is that the teams who measure behavior, not scores, are the ones who will get to deploy it.

This analysis synthesizes Initial Results on Legal Agent Benchmark (Harvey, May 2026).

Victorino Group helps regulated teams measure agent behavior, not just benchmark scores. Let’s talk.