Three Harness Architecture Decisions Buyers Are Making Without Knowing It

Last week, the harness became a vendor SKU. This week, it became an architecture decision.

Two posts landed seven days apart and they do not agree. SWE Quiz published an inside look at how OpenAI’s Codex team grew from 3 to 7 engineers in five months and shipped roughly 1,500 PRs and a million lines of generated code. Mendral published the opposite-side decision: their argument for putting the agent harness outside the sandbox, with a 25ms-resume runtime from Blaxel, durable execution via Inngest, and shared org state path-virtualized to Postgres.

We have spent two months arguing that the harness is the product and that vocabulary is now cross-discipline. Buyer-side audit was the natural next step. What this week added is sharper: there are three architectural decisions hiding inside every harness purchase, and most procurement teams are signing them blindly.

Decision One: Where Does the Loop Run

The Codex team and Mendral are not picking different vendors. They are picking different topologies.

The Mendral post is unusually direct about it. The agentic loop, they argue, “runs on your backend. When it needs to execute a tool, it calls into a sandbox over an API.” The sandbox is a leaf node. The orchestration brain, token usage, turn count, retries, durable checkpointing, prompt construction, memory writes, sits on the server you operate. Their stack is concrete: Blaxel for sandboxes that resume from standby in 25ms (“low enough that the agent can’t tell the sandbox was ever gone”), Inngest for durable execution with checkpointing per turn, Postgres for shared state that is virtualized into the sandbox under /skills/* and /memory/*.

Compare that to a managed-agent product where the loop, the sandbox, the filesystem, and the event log all live inside the vendor. Same surface. Different boundary. In Mendral’s design, your backend is the system of record. In the managed design, the vendor is.

Neither is wrong. They optimize for different failure modes. The Mendral topology fails gracefully when a sandbox crashes, the loop survives, the next turn resumes from a checkpoint. The managed topology fails gracefully when you crash, the vendor keeps the agent running while your platform team is asleep.

The question buyers do not realize they are answering is: which crash do you trust the other party to absorb?

Decision Two: How AGENTS.md Scales

The OpenAI piece reveals something operationally interesting that has nothing to do with hosting. It is a documentation pattern.

The Codex team’s AGENTS.md is not the wall of instructions most teams write. It is a table of contents pointing at four directories: docs/design-docs/, docs/product-specs/, docs/exec-plans/, and docs/references/. The agent reads the index, then fetches only the relevant document. Progressive disclosure replaces the instruction blob.

Why this matters at the architecture level: every harness implies a context strategy, and the context strategy compounds at team scale. A 3-person team can hand-curate a 5,000-token system prompt. A 7-person team shipping 1,500 PRs cannot. Either you build a context strategy that scales with the codebase, or the codebase silently outgrows the agent’s working memory and quality drops without anyone noticing.

The Codex team also runs per-worktree isolation: the agent spins up an isolated product instance per branch, then validates visually through Chrome DevTools Protocol. Engineers prompt “ensure service startup completes in under 800ms” and Codex validates directly. That is not a model capability. It is a harness capability, specifically, a sandbox-and-tooling capability, and it lives wholly inside the team’s own infrastructure.

The architectural decision buyers are making here is whether the harness they purchase exposes the seams to plug in their own conventions. AGENTS.md as table of contents only works if the harness lets you control the working directory, the file resolution rules, and the tool layer. Some managed harnesses do. Some do not. Procurement rarely asks.

Decision Three: Whose State Does the Agent Read

This is the decision that hides in plain sight and will hurt the most teams.

A single-user agent reads its own filesystem and its own memory. A multi-user agent inside a real organization reads shared state, the team’s skills, the org’s memory, the artifacts another agent wrote thirty minutes ago. Mendral’s architecture handles this explicitly: /skills/* and /memory/* are virtualized paths backed by Postgres, scoped per-user, per-team, or per-org, so when an agent writes a skill, every other agent in the right scope reads it next turn.

That is a database decision dressed as a filesystem decision. It implies row-level security, scope semantics, conflict resolution when two agents write to the same path, and retention policies for memory. None of which appears in a harness vendor’s overview page, because none of it sells.

OpenAI’s Codex team hits the same wall from a different angle. When the team grew, “Friday AI slop cleanup” became a repeating event, drift accumulated faster than humans could review. Their fix was structural: a background agent that scans for accumulated mess and proposes cleanup PRs. That only works when the agent can read shared codebase state, including the work other agents wrote during the week. It is the same shared-state problem in a different costume.

For buyers, the architectural decision is: does this harness model multi-user shared state as a first-class primitive, or as something we will bolt on later? The honest answer for most managed products today is “later.” The honest answer for production AI inside a real org is “we needed it on day one.”

Why the Decision Compounds

Each of these three decisions is recoverable on its own. Together, they are not.

A harness that puts the loop in the vendor cloud, ships AGENTS.md as an instruction blob, and treats state as single-user is not three small choices. It is an architecture that resists every refactor you will want in eighteen months. Migrating the loop boundary means rewriting orchestration. Migrating the context strategy means rewriting every system prompt. Migrating shared state means rewriting how agents read and write the world.

This is the procurement layer of last week’s harness-as-SKU shift. Vendors have made the harness easy to buy. They have not made the architecture obvious to evaluate. The two SKUs on a comparison sheet may sit on opposite sides of all three decisions, and the matrix will not show it.

What it will show, six months in, is friction. The team building real workflows hits the shared-state ceiling. The platform group hits the loop-boundary ceiling when an incident requires inspecting state the vendor owns. The architects hit the AGENTS.md ceiling when the codebase grows past the prompt window. None of these are vendor failures. All of them are architecture mismatches that were inside the contract on day one.

This is the same thing we wrote about in passive context for AI agents: the system that surrounds the model determines outcomes far more than the model does. The Codex team and Mendral both believe that. They picked opposite topologies because they are solving opposite problems. Buyers do not get to skip the choice.

The Procurement Checklist

Before signing the next harness contract, get written answers to three questions:

Loop topology. Where does the agent loop physically execute, and where does the system of record for turn-by-turn state live? If the answer is “the vendor” for both, your incident response runbook is a support ticket. If the answer is “us” for both, your platform team owns durable execution and you should hear them say “Inngest” or “Temporal” without flinching.
Context strategy at team scale. Can the harness load context progressively from a directory structure under your control, or does it expect a single instruction blob? Three engineers can live with the second. Seven cannot, and the gap is invisible until quality drops.
Shared state semantics. Are skills and memory scoped per-user, per-team, or per-org, and is that scoping enforced by the platform or by your application code? If the platform answers “per-user only,” every multi-user workflow you build is a security review waiting to fail.

Three questions. Each one decides an architecture. Each architecture compounds. The harness market is moving fast enough that vendors will not surface these decisions on their own, the page is optimized for activation, not procurement, and that is rational of them.

It is on the buyer to ask. Last week, harness was a vendor SKU. This week, it is an architecture decision. Next week, it will be a refactor someone is paying for.

This analysis synthesizes Harness Engineering: How OpenAI Ships Without Writing Code (SWE Quiz, May 2026) and The Agent Harness Belongs Outside the Sandbox (Mendral, April 2026).

Victorino Group helps procurement teams audit harness architecture decisions before they become vendor lock-in. Let’s talk.