Hooks Block, Evals Verify: The Deterministic Shell Around Probabilistic Agents

Two practitioners published in the same week, on opposite ends of the agent lifecycle, and described the same governance thesis without coordinating. Nader Dabit wrote about agent hooks: deterministic interception at six named lifecycle events, before and after every tool call, before and after every session. Cameron Wolfe, Staff Research Scientist at Netflix, published a long survey on agent evaluation centered on a metric called Pass^K, which measures consistency across all K independent attempts at the same task.

Hooks run before the action. Evals score after the action. Both refuse to trust the stochastic middle. Read together, they answer the same question from opposite directions: how do you build something deterministic around a model that, by definition, is not.

We have covered the containment stack as architecture, the vendor convergence that turned containment into a purchasable category, and the operational stack already shipping in production. The architectural picture is drawn. What this week added are the named primitives that practitioners will ship in code, in 2026. Six events. One metric. That is the deterministic shell.

The Six Events That Bound an Agent Session

Dabit’s framing is mechanical and worth memorizing. Hooks fire at six lifecycle events, each one a place where deterministic policy can intercept what the model would otherwise do on its own:

SessionStart. Inject context, load policy, set environment variables before the first prompt is processed.
UserPromptSubmit. Validate or rewrite the prompt before it reaches the model.
PreToolUse. Block, modify, or approve a tool call before it executes.
PostToolUse. Inspect or act on tool output before it returns to the model.
Stop. Run completion gates before the agent declares the task done.
SessionEnd. Cleanup, persistence, audit log emission.

The pattern is always the same shape: event → matcher → handler → outcome. The matcher decides whether the hook applies to this specific call. The handler is deterministic code. The outcome is allow, block, or modify. No stochasticity inside the handler. That is the entire point.

The examples Dabit gives are not theoretical. PreToolUse hooks that block edits to .env and .git. PreToolUse hooks that scan for rm -rf / or DROP TABLE before letting a shell or SQL call proceed. PostToolUse hooks that run the test suite after a file edit and roll back if it fails. Stop hooks that read a persisted .hook-state JSON file and refuse to declare completion until every required gate has fired. This is the same kind of policy enforcement an SRE writes for a deployment pipeline, except now the pipeline is an agent and the trigger is a tool call.

Why Hooks Beat “Better Prompts”

The temptation, when an agent does something dangerous, is to harden the system prompt. Add a paragraph about not deleting files. Add another paragraph about respecting working directories. Add a third paragraph reminding the agent to ask before destructive actions. After three or four iterations the system prompt is two thousand tokens of negative instruction, and the agent still occasionally runs rm -rf because that is what the next-token distribution suggested.

Prompts are probabilistic. Hooks are deterministic. The difference is not cosmetic. When you write a PreToolUse hook that pattern-matches rm -rf / and returns block, the agent cannot execute that command. Not “is less likely to.” Cannot. The hook is code, not persuasion.

This is the same lesson the security industry learned about input validation in the 2000s. You do not ask the user politely not to send SQL injection. You parse and sanitize at the boundary, deterministically, every time. Hooks are input validation for tool calls. The agent is the user. The tool is the database. The hook is the parser.

Pass^K, and Why It Is Stricter Than You Think

Wolfe’s piece reframes the eval question. Most teams have spent the last two years measuring agent quality with Pass@K: did at least one of K attempts succeed? That metric flatters models. An agent that succeeds 1 time in 5, with 4 catastrophic failures, scores the same as one that succeeds reliably. In production, the first agent is unusable. Pass@K cannot see the difference.

Pass^K measures the opposite. Did all K independent attempts succeed? It is the consistency metric, not the capability metric. Pass^K is what you care about when the agent is going to run in a loop, on a customer’s data, without a human watching each attempt. One failure in five is not a 20% problem. It is the only outcome you ever see in the incident postmortem.

The numbers Wolfe cites land hard. Terminal-Bench 2.0 distilled 89 production-grade tasks from 229 contributions, and GPT-5.2, the strongest model evaluated, hits 62.9% resolution. That is on Pass@1 with a single attempt. Tau^2-bench’s telecom domain is harsher: o4-mini scores 26% Pass^4. Run an o4-mini agent four times on the same telecom workflow and only one in four attempts produces consistent success across all four runs. Three in four show non-determinism that would matter to a customer.

Pass^K is not a hostile metric. It is the metric your customer is implicitly using. They run your agent on Tuesday and it works. They run it on Wednesday on the same input and it fails. Pass@1 says you have a 50% agent. Pass^2 says you have a 0% agent. Your customer agrees with Pass^2.

The Shell Has Two Walls

Stack Dabit’s six events on the inbound side and Wolfe’s Pass^K on the outbound side and the architecture is symmetrical. Hooks decide what gets in. Evals decide whether the output, run K times, is consistent enough to trust. The probabilistic core sits in the middle, doing what models do, with deterministic walls on both sides.

Side	Primitive	Question it answers
Inbound	Six lifecycle hooks (Dabit)	What is the agent allowed to do?
Outbound	Pass^K (Wolfe)	Does the agent do the same thing every time?

What both sides refuse to do is trust the model alone. The hook author does not believe the agent will avoid .env even with a perfect prompt. The eval author does not believe a single passing run says anything. Both authors have moved the trust boundary out of the model and into the surrounding code.

This is the same shift that happened to web applications when they stopped trusting client-side validation. Server-side validation is the deterministic wall. Hooks and Pass^K evals are the agent-era equivalent. The model is the client. The platform team writes the server.

The 65% Rule, Updated

We have argued before that production agentic systems settle at roughly 65% AI code and 35% deterministic scaffolding. Hooks and Pass^K evals are how the 35% gets specified. The 35% is not “extra plumbing.” It is the part of the system that the customer is paying for the reliability of. The 65% is the part that does the work. The 35% is the part that ensures the work was done correctly, every time, without leaking secrets, without touching files it should not, and without diverging across runs.

Teams that try to ship at 95% agent code and 5% scaffolding are not shipping a better agent. They are shipping an agent without the deterministic shell, and the customer will discover this on the day the agent does something that the prompt was supposed to prevent. Pass^K will say 12%. The incident review will say “we needed hooks.”

What to Do This Week

Pick one production agent. Just one. Walk it through three diagnostics:

Hook inventory. Write down every PreToolUse and PostToolUse hook you actually have in the system. If the list is empty, your agent is operating without an inbound wall. Pick the two most dangerous tool calls (file writes, shell, SQL) and write a PreToolUse hook that blocks the obvious destructive patterns. Block rm -rf /, DROP TABLE, edits to .env, edits to .git. That is one afternoon of work and it removes a class of incident.

Stop-gate state. Decide what “done” means for this agent, write it as a JSON state, and write a Stop hook that refuses to declare completion until every required field is satisfied. If the agent says “task complete” without running the test suite, the Stop hook should reject the completion claim and force another iteration.

Pass^K measurement. Take the ten tasks your agent runs most often in production. Run each one four times. Count how many run all four times identically and successfully. That is your Pass^4. If the number is below 50%, your customers are seeing non-determinism that will eventually become an incident. Tighten the prompts, tighten the hooks, or constrain the tool surface until Pass^4 comes up.

Hooks and evals are not the glamorous part of building agents. They are the part that decides whether the agent is something a serious company can put in front of a customer. Dabit gave us the six events. Wolfe gave us the metric. The deterministic shell is now a buildable specification, not a research direction. Build it this week.

This analysis synthesizes Agent Hooks: Deterministic Control for Agent Workflows (Nader’s Thoughts, May 2026), Agent Evaluation: A Detailed Guide (Cameron R. Wolfe, May 2026).

Victorino Group helps engineering leaders build deterministic shells around probabilistic agents. Let’s talk.