The context window is a budget, not a bucket

A model rated for 1M tokens starts losing the thread well before it gets there. The number on the box is the physical capacity of the container. The number that governs your output is the zone where attention still holds, and that zone sits closer to 100k tokens regardless of whether the box says 200k, 1M, or 2M.

Teams treat the advertised window as a bucket: pour in everything that might be relevant, trust the model to find what matters. The reliability data says the bucket has a false bottom.

The advertised number is not the working number

Garrit Franke’s review of the evidence (garrit.xyz, May 2026) lands on a blunt summary: do not trust large context windows. Two threads support it.

The first is RULER (arXiv 2404.06654), a benchmark that stretches retrieval and reasoning tasks across long inputs. Models that ace the simple “needle in a haystack” test degrade fast once the task requires holding several facts and reasoning over them at distance. The advertised length and the usable length are different measurements.

The second is Chroma’s work on what they call context rot. As the window fills, output quality drops in a non-linear way. Performance does not hold steady until a cliff at the limit. It erodes throughout, and it erodes faster when the context carries distractors, near-duplicates, or loosely related material. The practical “smart zone” cutoff sits around 100k tokens in Franke’s reading, and that figure barely moves when the advertised ceiling jumps by an order of magnitude.

Nielsen Norman Group reached the same place from the human-factors side. Paz Perez (NN/g, June 2026) puts it plainly: “More context does not lead to better results, because every element competes for the model’s attention.” Attention is the scarce resource. Tokens are just how you spend it. A window you fill to the brim is a budget you blew on noise.

Two architectures that respect the budget

If the working window is fixed near 100k while tasks keep growing, the answer is not a bigger container. It is an architecture that keeps each unit of work inside the budget. Calvin French-Owen (calv.info, June 2026) frames the two dominant patterns as the oracle and the firm. Treat the exact token figures below as approximate; they came from a fetch summarizer.

The oracle is compaction. OpenAI’s Codex keeps one coherent thread of roughly 200k tokens and, as it fills, summarizes earlier work back down so the thread stays inside the working zone. One mind, one continuous memory, periodically compressed. The strength is coherence: every decision sees the same history. The risk is the compression step. Summarize badly and you discard the fact you needed three turns later, with no record that it ever existed.

The firm is sub-agents. Anthropic’s pattern fans work out to specialized agents, each with its own context near the larger advertised ceiling, each returning a bounded result to a coordinator. The strength is parallelism and isolation: a research subtask that would bloat the main thread runs in its own context and hands back a summary. The cost is coordination. The coordinator never sees the full reasoning of each agent, only what each chose to report, so a wrong assumption inside one agent can pass upstream looking clean.

Neither pattern raises the per-step budget. Both keep the active context small and push the rest somewhere else: into a summary, or into a separate agent’s window. This is the same principle behind keeping utilization low and externalizing state, applied at the architecture level rather than the prompt level. We have covered the prompt-level mechanics in Context Engineering for AI Agents.

The part both architectures hide

Compaction and sub-agents share a weakness. They both decide, on your behalf, what the model gets to keep. Compaction throws away detail when it summarizes. Sub-agents throw away detail when they report. In both cases the discarded material is invisible after the fact, which means when an agent makes a strange call, you cannot reconstruct what it actually saw at the moment it decided.

That is the audit problem, and it is where governance enters. A bigger window does not give you an audit trail. Architecture without a record does not either. What you need is a reconstructable log: an append-only record of what entered the context at each step, what got compacted or delegated, and what came back. The agent’s behavior is a function of its context, so the log of that context is the closest thing to an explanation you can hold.

There is a supporting idea circulating that the log is the agent: that the durable, replayable record of everything the agent saw and did is more the system of record than any single live context window. Stated at that level, it follows directly from the reliability data. If the live window is small and lossy, the log is where the full history actually lives, and it is the only surface you can audit, replay, or hand to a reviewer.

This is why context engineering is a governance discipline, not a tuning trick. The same reasoning drove our argument that the 60% of AI work nobody governs is context engineering. The window you cannot inspect is the surface you cannot govern.

Why bloating the window backfires twice

Filling the window does more than waste budget. It actively degrades the run.

The first cost is attention dilution. Every distractor token competes with the signal tokens, and the model’s ability to retrieve the right fact drops as the irrelevant volume rises. The second cost is tool sprawl, which we traced in Code Mode and agent tool bloat: every tool definition, every verbose schema, every redundant result you leave in the context is paying rent in the one budget that determines output quality. A window stuffed with fifty tool specifications has less room for the actual problem, and the model has more ways to pick the wrong one.

So the instinct to “give the model more so it has everything it might need” trades a small convenience for two compounding penalties. Less material, more precisely chosen, beats more material every time the working window is the constraint. And the working window is always the constraint.

Do this now

Stop budgeting against the number on the box. Pick the architecture that fits the work: compaction when the task is one long coherent thread and continuity matters more than parallelism, sub-agents when the work decomposes into bounded pieces that can run in isolation. Then add the layer neither pattern gives you for free. Turn on a reconstructable log of what entered the context at each step, what got compacted or delegated away, and what came back. Measure your effective utilization, not your theoretical capacity, and treat anything past the smart zone as a design smell rather than headroom.

The window is a budget. Spend it on signal, keep the receipts, and architect so no single step ever has to hold more than it can actually attend to.

This analysis synthesizes Don’t trust large context windows (garrit.xyz, May 2026), The Oracle and the Firm (calv.info, June 2026), Context Architecture (Nielsen Norman Group, June 2026).

Victorino Group helps teams design context architectures they can audit, not just bigger windows they cannot. Let’s talk.