Cognition Now Scores Devin in Human-Equivalent Hours, and Bet $10M on It

Cognition published a method to answer one question for every Devin session: how long would a human engineer have taken to produce the same output? The unit is productive engineering hours, convertible to dollars at a standard global rate. Behind the number sits a $10 million guarantee per enterprise customer. This is the most concrete attempt yet by an agent vendor to price its own output in terms a CFO recognizes, and the methodology is worth reading line by line, because the honesty of the math is the whole story.

The Unit Is Human Hours, Not Tokens

Token counts measure consumption. Lines of code measure volume. Neither tells a buyer what the work was worth. Cognition’s estimator skips both and answers the question a manager actually asks at review time: if a person had done this, how many productive hours would it have cost?

That reframing matters because it puts the agent on a scale humans already use for budgeting, staffing, and ROI. An engineering hour has a price. A token does not, at least not one that maps to value. By converting session output into human-equivalent hours, Cognition makes Devin’s work additive to the same ledger the rest of the team is measured on. The estimator is itself an agent, scoring each session after the fact rather than during it.

How the Calibration Actually Works

The estimator is fit against real human estimates rather than guessed. Cognition built a ground-truth dataset of 258 sessions from 126 users, asked humans to estimate how long each session’s output would have taken, and fit a curve in log-space:

h = 2.28 x m^0.923

where m is the model-side measure and h is the predicted human hours. The exponent below 1 means the relationship is slightly sublinear, and the practical effect is close to a constant multiplier of roughly 2.08x. Devin’s raw output, in this calibration, maps to about twice the human hours its surface metrics would naively suggest.

The multiplier alone earns no trust. What earns it comes in the next figure.

The Held-Out Eval, and the Honesty in It

A model that fits its own training data proves little. Cognition held out a separate set of 233 sessions and scored the estimator against fresh human judgments. The result: a log-space correlation of 0.74 (r-squared of 0.54). About half of all sessions land within 2x of the estimate.

Read that the way a statistician would. An r-squared of 0.54 means the method explains a little over half the variance in human estimates. Half the sessions miss the estimate by more than a factor of two. Cognition published both figures rather than rounding them into a marketing claim. That is the part worth copying. Most vendor productivity numbers arrive as a single confident multiple with no error bar, no held-out set, and no admission of where the method is weak.

They also showed their work against the weak alternative. An estimator built on lines of code alone scored an r-squared of just 0.27, roughly half as predictive. The lesson is direct: volume metrics are poor proxies for value, and they can prove it on their own data.

Where It Sits Against Prior Attempts

Cognition placed its result next to two earlier measurements, which is how honest methodology work is supposed to read. METR reported a log correlation of 0.83, but on 34 sessions from 7 staff, a small and controlled sample. Anthropic reported 0.46 across 1,000 Jira tickets using only titles and descriptions, a large sample with thin signal per item. Cognition’s 0.74 across 233 held-out sessions sits between them on both axes: more data than METR, richer signal than the Jira-title approach.

None of these is the final word. They are three points on a young curve, and the useful move is comparing them in the open rather than each vendor claiming its own number is definitive.

The Baseline Rules That Stop the Inflation

The easiest way to fake a productivity number is to define the baseline generously. Cognition wrote explicit rules to prevent that, and they are the most quietly important part of the post.

The estimator reasons about the path a human would have taken, not the detours the agent took. It credits only work the user did not already specify, so boilerplate the human dictated does not count as agent value. It assumes the human has the relevant expertise, removing the cheat of comparing the agent to a novice. It accounts for codebase familiarity, since an engineer who knows the system moves faster than the agent’s worst-case human stand-in. Between 1 and 20 percent of sessions get filtered out as unproductive rather than counted at zero or, worse, as positive.

Each rule pushes the estimate down, toward a tougher human baseline. A vendor optimizing for a flattering headline would have written the opposite rules. These choices are what make the 2.08x multiplier credible instead of convenient.

The $10M Guarantee, and the Limit It Names

The methodology underwrites a commercial promise. Scott Wu, Cognition’s CEO, announced an AI Productivity Guarantee: for each enterprise customer, if Devin delivers less engineering value than the customer paid, Cognition funds usage up to $10 million until it does, assessed near the end of the annual contract. The estimator is the instrument that decides whether the bar was cleared.

Then comes the sentence most vendors would have cut. Cognition concedes the estimator “does not replace measuring ROI.” The method scores the agent against a human baseline. It tells you how many human hours Devin’s output represents. It does not tell you whether that output advanced revenue, reduced risk, or moved any business outcome the customer actually buys software to move. Engineering value is an input. Business ROI is the result, and Cognition states plainly that it has not solved the second.

That admission defines the boundary precisely. The method scores one player, the agent, against a synthetic human-of-equal-skill. It does not yet score the team that actually ships, the humans and the agent together, on the outcomes the business is paying for. Measuring the agent against a hypothetical solo human is a real advance. It is not the same as measuring whether the combined team produced something worth the budget.

Do This Now

If you are evaluating any AI productivity claim, hold it to the bar Cognition just set, and apply that bar to Cognition too. Demand four things before you believe a multiple. A held-out evaluation set the method never trained on. A reported correlation with an error band, not a single clean number. Explicit baseline rules that you can read and challenge, written to make the comparison harder rather than easier. And a stated boundary: what the number does measure, and what it admits it does not.

Then take the gap Cognition named and treat it as your own work. Engineering hours are an input you can now estimate. Business ROI is the output you still have to instrument yourself, on your own metrics, with your team and your agent on one scoreboard. A vendor measuring its agent against a hypothetical human is useful evidence. It is not a substitute for measuring whether your actual team, augmented by that agent, delivered value worth what you paid. Build that measurement before you sign, not after the renewal arrives.

The bar for an AI productivity claim just moved. Held-out eval, reported correlation, honest baselines, named limits. Anything less is a number without a method behind it.

This analysis synthesizes Measuring AI Productivity (Cognition, June 2026), AI should earn its keep (Cognition, June 2026).

Victorino Group helps teams build measurement they can trust before they bet a budget on it. Let’s talk.