Determinism Is Governance: The Control Layer Is the Code Around the Model

On June 2, 2026, Anthropic shipped dynamic workflows in Claude Code: Claude now writes JavaScript orchestration on the fly, fans out to tens or hundreds of subagents with fresh scoped context, and iterates until the results converge (InfoQ, June 2026). The multi-step plan stopped living in the model’s in-context memory. It moved into inspectable code.

That move is the whole argument. We have written before that the workflow is a governance primitive and that the harness is where agent behavior is shaped. Those pieces made the case in principle. June gave the principle a mechanism and a measurement. The mechanism is method-as-code. The measurement is PwC’s retrieval numbers, which show how much a harness decision moves accuracy on a model you cannot change.

The plan belongs in code, not the context window

A long-running agent task is a multi-step plan. Step three depends on what step two found. Step seven re-checks step four. When that plan lives in the model’s context window, it degrades the way any in-context state degrades: tokens fall out of attention, earlier decisions get paraphrased into something slightly different, and by step nine the agent is reasoning over a blurred copy of its own earlier intent.

Anthropic’s dynamic workflows put the plan in JavaScript. The orchestration logic is written, then executed. A loop that fans out to two hundred subagents is a for loop with a convergence check, not a paragraph of instruction the model has to re-read and re-honor on every turn. Each subagent gets a fresh, scoped context instead of inheriting the accumulated drift of the parent. The convergence gate is a condition in code: keep iterating until the results stop changing, then stop.

This is auditable in a way a prompt never is. You can read the loop. You can set a breakpoint. You can prove how many subagents ran, what context each received, and what condition ended the iteration. A prompt that says “keep refining until the answer is good” gives you none of that. The control surface is the code around the model, and code is inspectable by construction.

The deterministic layer has a name now

Builder.io’s Agent Experience post names the layer the rest of the industry has been building without naming. Their seven tenets describe what it takes for agents to work safely inside a real codebase, and two of them are pure determinism.

The first is deterministic safety: sandboxing, scoped credentials, and approval gates that hold regardless of what the model decides to do. The model proposes; the deterministic layer disposes. A scoped credential cannot be talked out of its scope by a clever generation. An approval gate fires on a condition, not on the model’s confidence.

The second is verification before handoff. An agent does not pass its output downstream on the strength of having produced it. The output runs through a check, written in code, before anything depends on it. Builder.io’s framing of the economics is exact: “Spend tokens before spending reviewer attention.” Let the agent burn compute verifying its own work against a deterministic test, so the human reviewer sees only what survived the test. Reviewer attention is the scarce resource. The deterministic check is how you protect it.

Neither tenet trusts the model to govern itself. Both put the guarantee in the surrounding code. That is the same shape as Anthropic’s orchestration loop, arrived at from the opposite direction: one from how agents plan, the other from how agents ship.

Even retrieval is a harness decision

The strongest evidence that control lives in the harness comes from PwC, in research published as Is Grep All You Need? (June 2026). The team ran LongMemEval, a 116-question benchmark for long-context retrieval, and varied only the harness around a fixed model.

The headline result: grep beat vector search. Agentic grep scored 83.6 to 93.1 percent. Vector retrieval landed at 62.9 to 83.6 percent. An agent that searches a corpus with literal text patterns, the way an engineer greps a codebase, outperformed the embedding-and-similarity machinery that retrieval-augmented generation treats as the default. The model was identical across both. The retrieval method was the variable.

Two more numbers make the point sharper. Harness design alone swung accuracy from 76.7 to 93.1 percent, a 16-point spread on the same model with no change to weights or prompt. And delivery mode mattered even more: handing the agent results inline scored 93.1 percent, while writing the same results to a file the agent then had to open dropped it to 55.2 percent. Same model, same data, same question set. The only difference was a deterministic choice about how the harness presented information to the model.

A 38-point collapse from a file-versus-inline decision is not a tuning detail. It is proof that the substrate around the model carries more of the outcome than the model does. You can pay for a better model and recover a few points. You can fix the harness and recover thirty-eight. The leverage is in the layer you write, not the layer you license.

What this means for governance

Auditing a prompt tells you what you asked for. It does not tell you what happened. The prompt is an intention; the harness is the execution. When the orchestration is method-as-code, the safety is deterministic gates, and the retrieval is an inspectable choice, the agent’s behavior becomes something you can trace, test, and reproduce. That is the definition of governable.

This reframes where governance work goes. Most teams spend their governance budget on prompt review and model selection, the two surfaces that determine the least. The surfaces that determine the most, orchestration logic, credential scoping, verification gates, and retrieval method, often have no owner at all. PwC’s 38-point swing came from a layer most teams do not even instrument.

Do this now

Pick one production agent and locate its control surface. Ask three questions. Where does its multi-step plan live: in the context window, or in code you can read? When it retrieves, does it grep or embed, and has anyone measured the difference on your corpus? When it hands work downstream, what deterministic check runs first, and who owns that check?

If the honest answers are “the context window,” “embeddings, untested,” and “no check,” your agent has no governable control layer. It has a prompt and a hope. Build the deterministic layer next: a written orchestration loop with a convergence condition, a measured retrieval method, and a verification gate that runs before any human looks. Spend the tokens before you spend the attention.

The model is the part you cannot change. The code around it is the part you govern. June made that concrete: the plan is now JavaScript, the safety is now a gate, and the retrieval method is now a measured, 38-point decision. The control layer was never the prompt.

This analysis synthesizes Dynamic workflows in Claude Code (InfoQ, June 2026), Agent Experience is the new Developer Experience (Builder.io, June 2026), and Is Grep All You Need? (PwC, June 2026).

Victorino Group designs the deterministic harness layer where agent control actually lives. Let’s talk.