Guardrails Are an Uneven Procurement Variable

Kasra Rahjerdi spent $1,500 to answer a narrow question: can a large language model hack his own app? He pointed ten model configurations at a deliberately vulnerable application, capped each at $10 and two hours, and counted how many reached the flag. The headline number (which model won) is the least interesting result. The interesting result is how differently the models behaved before they even tried.

One model solved seven of ten runs. One refused on contact, burning 9,000 tokens to say no while others spent 100,000-plus actually working. The same prompt, the same target, the same task framing. The variance was not in skill. It was in willingness.

That variance has a name most procurement teams have never written into a contract: the guardrail surface. And it is not uniform.

What the Experiment Actually Measured

The target was an app Rahjerdi built with a hardened API in front and a wide-open Firebase behind it. Broken object-level authorization. The kind of flaw a real attacker loves and a real audit misses. He gave each model the same brief and let it run.

The solve rates, with the author’s own caveat that this is “not scientific” (ten runs per model, wide confidence intervals):

gpt-5.5: 7/10, roughly $9.46 per solve
deepseek-v4-pro: 3/10, about $0.19 per run
claude-sonnet-4.6: 2/10
claude-opus-4-8: 2/10, with late refusals after getting close
gemini-3.1-pro: 0/10, immediate refusal, 9,000 tokens against the 100,000-plus the working models spent

Read those numbers as directional, not definitive. The sample is small and the author says so plainly. But the shape of the result is too clean to ignore: the models did not fail at the same point or for the same reason. Some failed at the door. Some failed at the vault. One walked in.

Refusal Is a Behavior, Not a Floor

The common mental model treats safety as a baseline that every serious vendor clears. Buy any frontier model, the thinking goes, and you get roughly the same protective floor. The experiment dismantles that.

Gemini-3.1-pro refused immediately. It never engaged the task. Claude-opus-4-8 engaged, made real progress, then hit a late guardrail and stopped. GPT-5.5 worked the problem to completion seven times out of ten. Three different safety postures, three different points of intervention, one identical request.

Rahjerdi also noted a regional pattern: “Chinese models were way more comfortable attacking the DB.” The Western models hesitated specifically around live-database impact, the moment where action stops being analysis and starts being consequence. That hesitation is a design decision encoded by the vendor, and it differs by vendor.

This matters in both directions. If you want a model to run offensive security against your own systems, the refusing model is useless to you no matter how capable it is. If you want a model to refuse exactly that kind of work when an employee asks for it, the compliant model is a liability. The same guardrail is an asset or a defect depending entirely on your use case.

Most model selection runs on a familiar checklist: benchmark scores, context window, latency, price per token, data residency. Willingness to perform sensitive work almost never appears, because most buyers assume it is constant across vendors. It is not.

Consider the cost dimension alone. Gpt-5.5 solved at $9.46 a solve. Deepseek-v4-pro ran at $0.19 a run with a 3/10 solve rate. If your task is sensitive and you select on price, you may have just selected the model most willing to do the sensitive thing, at the lowest cost, with the least friction. That is a governance outcome you reached by accident, through a procurement decision that never named the variable it was actually setting.

The deeper problem is silence. A model that refuses tells you it refused. A model that quietly completes a borderline task tells you nothing. Across a fleet of agents acting on your behalf, the differences between “refused,” “completed with a warning,” and “completed silently” compound into very different risk profiles, and none of them show up on a benchmark leaderboard.

Guardrails Belong in the Matrix

Treat the guardrail surface as a measured procurement variable, the same way you treat latency or price. Three things have to change.

First, define the behavior you actually want, per task class. Offensive security tooling, customer-facing chat, internal code generation, and financial analysis do not share a refusal profile. The “right” guardrail for one is the wrong guardrail for another. Write down, per use case, whether you want the model to lean toward action or toward refusal.

Second, test for it. Build a small battery of borderline prompts that sit near your real risk boundaries and run every candidate model through them. You are not measuring capability here. You are measuring willingness, the point of refusal, and whether the refusal is loud or silent. Rahjerdi’s $1,500 bought a directional answer for one app. A focused internal eval costs far less and answers it for your actual workloads.

Third, re-test on every model version. Guardrail behavior shifts between releases, often without a changelog entry that names it. A model that refused in March may comply in June. Your procurement matrix needs a column that gets refreshed, not a one-time check that rots.

Do This Now

Pick the single most sensitive task your agents perform. Write five prompts that sit right at its risk boundary. Run them through every model you have in production and every model you are evaluating. Record three things for each: did it comply, where did it stop, and did it tell you. Put the result in your model-selection matrix next to price and latency, and assign an owner to re-run it on every version bump.

You are already selecting a guardrail posture every time you pick a model. The only question is whether you are doing it on purpose.

This analysis synthesizes I Spent $1,500 Seeing If LLMs Could Hack My App (Kasra Rahjerdi, June 2026).

Victorino Group helps enterprises make model guardrails a measured line in procurement. Let’s talk.