Turning the Reasoning Dial to Max Made the AI Worse

2,080 model-and-effort combinations ran against two real security vulnerability cases, and 70.8% of them found some piece of the bug. Only 1.9% produced a complete solution. That spread is the first warning. The second is sharper: in this experiment, turning the reasoning dial higher did not reliably make the answers better. Parsia, a security engineer at Microsoft, ran the test and published the numbers. One line summarizes the result better than any chart: “it looks like higher reasoning effort (and even later models) are not always better for triaging security results.”

The instinct in most teams is the opposite. When an AI answer looks shaky, push the model harder. Raise the reasoning effort, pick the newer release, give it more room to think. This is one well-documented experiment, not an industry benchmark, but it puts a number on a thing most teams assume away: the dial does not always turn the way you expect.

The dial is non-monotonic

The clearest single data point is a head-to-head between two effort levels of the same model family. gpt-5.5 at medium effort scored 0.360. The same model at extra-high effort scored 0.327. More reasoning, worse result, same task. Not a rounding wobble, a reversal.

A monotonic dial would mean every increase in effort buys at least a little more quality. This one bends. Past a peak, additional reasoning on this task started costing accuracy rather than adding it. The model, given more room to think, talked itself into worse answers on some cases. Anyone who has watched a capable colleague overthink a simple call recognizes the shape.

The operational consequence is direct. If you have standardized on “max effort, latest model” as your safety setting, you may be paying more for a configuration that scores lower on the work you actually care about. The only way to know is to measure your task at multiple effort levels, not to assume the top of the dial is the safe end.

Scope moved the needle more than effort

The largest swing in the entire dataset did not come from reasoning effort at all. It came from how the problem was framed.

When the analysis was scoped to a single function, results beat whole-file analysis by up to 31.7% on one configuration (gpt-5.4 at extra-high effort). Same model, same effort, same vulnerability. The only change was the size of the window the model had to reason over. Narrow it to the relevant function and accuracy climbed. Hand it the whole file and the model had more to get lost in.

The case-level numbers make the point louder. On the openbsd-sack vulnerability, whole-file analysis landed at 1.7% across all models. On the freebsd-nfs case, whole-file analysis hit 90.8%. The framing of the task dominated the outcome far more than which model or which effort level was chosen. A bad scope sank every model. A good scope floated most of them.

This is the lever teams underuse. Reasoning effort is a knob on the model. Scope is a decision you own. You decide whether the agent sees one function or ten thousand lines. In this experiment, that decision was worth more than any setting on the model itself.

A council beats a single confident run

If one model at max effort is not trustworthy, swapping in a different single model does not help either. What helps is structure around the models.

Parsia built an LLM triage council: multiple models voting on each finding rather than one model deciding alone. The council reached 86.2% unanimity, with only 2.8% of cases producing no majority at all. The remainder fell in between, flagged as genuinely contested. That is the signal you want. When the council agrees, you have corroboration from independent runs. When it splits, you have a flag that says “a human should look here,” which is more honest than a single model’s confident wrong answer.

This connects to a pattern we have written about before, in the decoupling of output competence from verification. A model can produce fluent, confident, wrong output, and fluency is not evidence. A council does not fix any single model. It makes disagreement visible, and visible disagreement is a verification primitive. One model that is sure of itself gives you nothing to audit. Five models, three of which dissent, give you a map of exactly where the work is shaky.

Effort also raised the refusal rate

There is a cost to cranking effort that has nothing to do with accuracy. Refusals climbed with it.

At the high end, claude-4.7-1m at extra-high effort hit a 21% content-filtering refusal rate on this security-triage work. The task was legitimate vulnerability analysis, the kind a security engineer does every day. More reasoning effort correlated with more cases where the model declined to answer at all. Push the dial up and a fifth of your work can come back as a refusal rather than a result.

The spend tells its own story. The full experiment cost roughly $9,200, and Claude ran three to four times pricier than GPT through Copilot. So the high-effort configuration on the more expensive model bought, in places, both a higher refusal rate and a lower accuracy score. That is the opposite of the intuition that more is safer. On this task, more was sometimes slower, costlier, and less useful all at once.

This is the domain-specificity trap we flagged in the domain expertise tax. The right setting gets discovered per task, per model, per scope. Carrying a global default between tasks is where the error lives.

Do this now

Pick one AI task your team runs on autopilot at “max effort, latest model.” Security triage, code review, contract analysis, any task where you trust the top of the dial because it feels safe.

First, run it at three effort levels, not one. Medium, high, extra-high, on the same inputs, and score the outputs against a small set of cases where you already know the right answer. If medium ties or beats extra-high, you have been overpaying for a worse result. The reversal in this experiment was not exotic; it was a default model family at two adjacent settings.

Second, test two scopes. Run the same task once on the narrow, relevant slice and once on the whole document. The 31.7% swing in this data came from scope alone. If narrowing the window lifts your accuracy, scope control belongs in your pipeline before any model upgrade.

Third, stand up a two-or-three model council on your highest-stakes decisions. You do not need a research budget. Run the same prompt through the models you already pay for, compare the outputs, and route every disagreement to a human. The agreement rate is your confidence signal. The disagreements are your audit queue. That structure, not a bigger single run, is where trust comes from. The same lesson holds when agents game the verification itself: the answer is always structure outside the single model, never more faith inside it.

This analysis synthesizes Brain the Size of a Planet: Are LLMs Thonking Too Hard? (Parsia, June 2026).

Victorino Group helps teams build verification structure around AI output instead of trusting a single model run. Let’s talk.