Your Agent Permission Model Works 40% of the Time

If your agent’s governance story is “we put the rule in the system prompt,” you have just described a 40% control.

That is not rhetoric. It is the headline number from ManyIH-Bench, a benchmark published on April 14, 2026 by Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme, and Daniel Khashabi at Johns Hopkins CLSP. The authors stacked up to 12 privilege tiers on 853 agentic tasks drawn from 46 real-world agents. They measured how often frontier models honor the highest-privilege instruction when instructions conflict.

The results:

Gemini 3.1 Pro: 42.7%
Kimi K2.5: 42.4%
Qwen 3.5-397B: 41.0%
GPT-5.4: 39.5%
Claude Sonnet 4.6: 39.1%

The same models score above 99% on standard two-tier instruction-hierarchy evaluations. Add realistic depth and the scoreboard collapses.

The Four-Tier Lie

Every major agent SDK inherits a convenient approximation. OpenAI’s 2024 Instruction Hierarchy paper formalized it: system outranks developer outranks user outranks tool. The Model Spec extended the idea into a five-role Chain of Command. IHEval tested it across four roles and found the best open-source model resolving conflicts at 48%.

That was always a chat-app abstraction.

A production agent does not live in a four-tier world. It receives instructions from system prompts, developer prompts, user turns, retrieved documents, tool outputs, peer agents, memory, session state, policy engines, compliance rules, and downstream callback arguments. Not four sources. Eight to twelve, routinely.

ManyIH-Bench is the first benchmark that tests what happens when the stack is honest about its depth. And the ceiling it finds is the ceiling your agent is running on right now.

What The 40% Actually Means

Read the critique carefully and the number becomes more useful, not less.

The 12 tiers in ManyIH-Bench are synthetic ordinal depth, not named roles. Real production agents deal with partial orderings, dynamic authority, and conflicts that look more like trade-offs than contradictions. That is why this is a stress test, not a production-trace simulation.

The 40% is also not a hard mathematical ceiling. Switching the priority wrapper from ordinal ([[Privilege N]], lower wins) to scalar ([[z=N]], higher wins) costs GPT-5.4 and Claude Opus 4.6 eight percentage points. That sensitivity means the models are not reasoning about priority. They are pattern-matching on a format. Fine-tuning or a dedicated policy engine could almost certainly lift the number.

What the 40% is: the structural performance of the in-model mechanism for instruction prioritization under realistic fan-in. Chain-of-thought does not rescue it. Qwen 3.5-397B burns roughly seven thousand reasoning tokens and still loses to GPT-5.4 at one thousand. Coding correctness on the same benchmark exceeds 86%, while style-priority compliance stays below 67%. The agents can do the task. They cannot reliably choose which conflicting directive to honor.

This is the quantitative confirmation of a qualitative argument we have made before: as we wrote in Boundaries Beat Instructions, trust has to live outside the model. ManyIH-Bench gives that argument a number.

The Coincidence That Isn’t A Coincidence

Forty percent keeps appearing. We wrote about it in Why Your AI Fails 40% of the Time in a different context: the accuracy wall on complex agentic tasks. Now it shows up again in priority resolution.

It is tempting to call this coincidence. It is not.

Different surfaces are hitting the same wall because the underlying capability is the same: reliable reasoning under uncertainty when multiple signals compete. Whether you measure that as task completion, hallucination rate, or priority resolution, frontier models plateau at a level that makes the behavior interesting to demo and unsafe to govern.

Why You Cannot Layer Your Way Out

The engineering instinct is to add more tiers. More roles. Richer taxonomies. A detailed Model Spec with explicit arbitration rules.

ManyIH-Bench answers that instinct directly: accuracy falls monotonically as tiers increase. Every model shows drops between 6.8% and 24.1% as the hierarchy moves from six to eight to twelve tiers. Adding structure does not help. It hurts.

Architectural simplification — fewer decision points, clearer trust boundaries, hard isolation between privilege domains — beats richer taxonomies. This is also why the four containment patterns for agent sandboxing matter more than any system-prompt update. A sandbox does not need the model to honor a priority rule. It enforces the rule whether the model agrees or not.

The Benchmark Is Contamination-Resistant

The usual dismissal of “frontier models fail at X” claims is that the benchmark leaked into training data. That dismissal does not apply here.

ManyIH-Bench generates its conflict pairs per sample, at benchmark construction time, using Claude Sonnet and Opus 4.6. Even if MBPP and AgentIF are in training data (they are), the specific priority-tagged conflict sets are not. The evaluation is not retrieval. It is runtime compliance — does the output satisfy the highest-privilege constraint while producing a functionally correct artifact. Human verification of 100 random samples returned 81 faithful, 11 unclear, 8 incorrect, setting a known noise floor.

This matters because it is the first time we can point to the priority-reasoning failure without waving our hands about leaked test sets. As we argued in Your LLM Benchmark Score Is A Scaffold Artifact, benchmarks without this kind of discipline tell you about training surface area, not capability. This one tells you about capability.

The Gap Your SDK Won’t Tell You About

We have also written about the governance gap in the OpenAI Agents SDK: the SDK learned to run before it learned to be governed. ManyIH-Bench sharpens that critique. Every major agent framework — OpenAI’s SDK, LangChain, AutoGen, Anthropic’s tool-use loop — inherits the four-tier assumption wholesale. None of them run on a 12-tier reality, and none of them ship with tests for what happens when the reality shows up in production.

The 40% is the base rate of the scaffold they hand you.

The Question To Take To Your Vendor

If you are a CISO, a Head of AI Governance, or the executive sponsor of an agent program, there is one question worth asking your AI platform vendor this week:

“Show me your benchmark on priority resolution at eight or more instruction sources, and show me the mechanism that enforces priority outside the model when the model fails.”

If the answer is the system prompt, you are looking at a 40% control. If the answer is “we have guardrails,” ask which of those guardrails are enforced by a process outside the LLM’s generation step. If the answer is “the model is smart enough to figure it out,” end the meeting.

Governance cannot live where priority resolution is a token-matching task. It has to live in sandboxes, capability tokens, policy engines, and formal-verification gates — in structures that do not require the model to do the right thing, because the model, measurably, will not.

That is what 40% actually buys you: the honesty to stop outsourcing governance to a system prompt and start building it at the system layer.

This analysis synthesizes Many-Tier Instruction Hierarchy in LLM Agents (April 2026), The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (April 2024), IHEval: Evaluating Language Models on Following the Instruction Hierarchy (February 2025), Control Illusion: The Failure of Instruction Hierarchies in LLMs (February 2025), and AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses (June 2024).

Victorino Group helps teams design agent governance that doesn’t bet on a 40% base. Let’s talk.