The Model Lab Contains Its Own Agent at the Environment Layer First

TV
Thiago Victorino
8 min read
The Model Lab Contains Its Own Agent at the Environment Layer First

For months we have argued a single ordering principle: secure agents at the environment layer first, steer behavior at the model layer second. Sandboxes, VMs, and egress controls catch what prompts and classifiers miss, because the deterministic boundary holds when the probabilistic one fails. The objection was always the same. That is the view from outside, from people who do not build the model. Maybe the lab that does build it trusts the model more.

It does not. Anthropic published How We Contain Claude in June 2026, an engineering account of how it isolates its own agents. The ordering it states is the one we have been writing. By its own account: “Design for containment at the environment layer first, then steer behavior at the model layer.” And the line that could have been our subhead: “the deterministic boundary is what gets hit when everything probabilistic misses.”

This is not us being proven right by a press release. It is the first first-party data we have ever had from the lab that ships the most capable model, and the data is more specific than anything the outside-in case could supply.

The Reversal, In Their Own Words

We covered the outside-in version of this in Agent Containment Became Table Stakes, where four uncoordinated vendors shipped containment layers in a single week. Procurement validated the pattern before any vendor explained it. What was still missing was the builder’s perspective. If anyone had grounds to trust the model to police itself, it would be the team with the deepest access to the model’s weights, evaluations, and training signal.

Anthropic reports the opposite stance. It names three risk types it designs against: user misuse, model misbehavior, and external attackers running prompt injection or supply-chain attacks. None of those three are fully solvable inside the model. Misuse routes around alignment. Misbehavior is the model itself. Injection is adversarial input crafted to defeat the classifier. So the lab places the load-bearing defense outside the model and treats the model layer as the second line, not the first.

The honest framing here matters. This is a vendor describing its own products, so the narrative favors Anthropic by construction. What gives it weight is not the marketing. It is the disclosed failures, which we read below. Every breach Anthropic admits bypassed the model layer while the environment boundary held or had to be hardened. That is the shape our thesis predicts.

Isolation Scaled to Oversight Capacity

The most useful operating principle in the piece is not the ordering. It is the tiering. Anthropic does not apply one containment design to every product. It scales isolation strength to how much human oversight each surface can realistically carry.

Three products, three different boundaries:

  • claude.ai runs in gVisor containers with seccomp on isolated server infrastructure, each session getting an ephemeral filesystem. A consumer chat surface has near-zero per-action human review, so the boundary is a hardened container that resets between sessions.
  • Claude Code uses OS-level sandboxing, Seatbelt on macOS and bubblewrap on Linux, read-only by default outside the workspace, with permission prompts for write, bash, and network. A developer is in the loop, so the design leans on a sandbox plus interactive approval.
  • Claude Cowork runs a full VM through a platform hypervisor, Apple Virtualization on macOS and HCS on Windows. Credentials stay in the host keychain, network isolation is enforced over vsock, and file mounts come in read-only, read-write, or read-write-no-delete modes. A long-running autonomous worker gets the heaviest isolation because the oversight per action approaches zero.

Container, sandbox, VM. The progression tracks oversight capacity, not product prestige. This is a principle our readers can apply directly. Tier your own agent deployments by how much human review each one actually gets, then match isolation strength to that number. An agent nobody watches gets a VM. An agent a developer babysits can live in a sandbox. The mistake is giving the unwatched agent the watched agent’s containment.

The Numbers We Never Had

The outside-in case could argue the principle. It could not measure it. Anthropic’s account supplies the internal numbers, and they quantify exactly why the model layer cannot carry the load alone.

On the oversight side, Anthropic reports that users approved roughly 93% of Claude Code permission prompts. Approval rates rose with experience while oversight quality fell. That is approval fatigue, measured. A permission prompt that gets a yes 93% of the time is not a control. It is a speed bump that users learn to roll over. Anthropic’s own mitigation is telling: adding the sandbox produced an 84% reduction in permission prompts. The fix for fatigue was not a better prompt. It was fewer prompts, made possible by a deterministic boundary that no longer needed to ask.

On the model side, the numbers are better than I expected and still not enough. Claude Code’s “auto mode” catches about 83% of overeager behaviors by Anthropic’s count. The other 17% slip through. Claude Opus 4.7 holds prompt-injection attack success to roughly 0.1% on single attempts, which is strong, but rises to 5 to 6% after 100 adaptive attempts on the Gray Swan Agent Red Teaming benchmark Anthropic cites. An attacker who runs continuously, the kind we described in Three Autonomy Failures, Three Blast Radii, does not run one attempt. It runs the hundred. A 0.1% rate becomes a 5 to 6% rate the moment the adversary is patient, and patience is free for an agent.

So the model layer is good and porous at the same time. Good enough to be worth having. Porous enough that the boundary behind it has to be real.

The Failures All Breached the Model Layer

The strongest evidence in the piece is what Anthropic admits went wrong. Read as a set, the disclosed breaches are a clean demonstration that the model layer is the one that gets bypassed.

A direct prompt-injection phishing campaign exfiltrated AWS credentials in 24 of 25 attempts. By Anthropic’s account, the filesystem and network boundary was the defense that mattered once the model-layer classifiers had failed. Twenty-four out of twenty-five is not a porous model. It is a defeated one, on that specific attack, with the environment layer left holding the line.

Cowork suffered an approved-domain exfiltration. An API allowlist meant to constrain where the agent could send data let an attacker’s own API keys upload files to the attacker’s account, because the destination domain was on the list even though the credentials were hostile. Anthropic remediated with a man-in-the-middle proxy inside the VM that validates session tokens. Note where the fix lives: inside the environment, not in the model’s judgment about whether the request looked suspicious.

Then the subtle one. Code execution before the trust dialog. Hooks in .claude/settings.json were parsed at startup, which meant opening a malicious project could run code before the user ever saw a trust prompt. The fix principle Anthropic states is durable beyond this bug: treat project-open, config-load, and localhost listeners as untrusted input. The attack did not go through the model at all. It went through the harness, before the model was even in the loop.

Three admitted failures. None of them were stopped by the model deciding to behave. Each was stopped, or had to be re-contained, at the deterministic boundary. We made the boundary-as-unit-of-analysis case in The Network Is the New Identity Boundary; Anthropic’s incident list is the first-party confirmation that the boundary is where the fight actually happens.

Do This Now

Take your highest-autonomy agent and answer one question Anthropic’s data forces: what is your real per-action approval rate? If it is anywhere near 93%, your permission prompts are not a control and you should stop counting them as one. Replace the fatigued prompts with a deterministic boundary that does not need to ask, the way Anthropic cut prompts 84% by adding a sandbox.

Then tier your fleet by oversight capacity, not by importance. For each agent, write down how many of its actions a human actually reviews in a normal week. The agents near zero get the heaviest isolation, a sealed VM or its equivalent, regardless of how much you trust the underlying model. The lab with the deepest possible trust in its own model still gives its least-watched product the heaviest boundary. Match that discipline, because the model layer you are relying on holds attack success to 0.1% on the first try and 5 to 6% by the hundredth, and your adversary is the one running the hundredth.


This analysis synthesizes How We Contain Claude (Anthropic, Jun 2026).

Victorino Group helps CTOs and risk officers tier agent deployments by oversight capacity and put deterministic boundaries where the model layer ends. Let’s talk.

All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation