Engineering Notes

The Context Crisis: Three Architecture Bets That Shrink the Agent's World to Make It Work

TV
Thiago Victorino
9 min read
The Context Crisis: Three Architecture Bets That Shrink the Agent's World to Make It Work

Seventy-two percent. That is how much of a 200,000-token context window one engineering team burned on tool definitions before their agent processed a single user message. 143,000 tokens consumed by the protocol. 57,000 left for actual work.

We flagged this pattern three weeks ago. In The Operations Tax, we reported Kan Yilmaz’s finding that MCP dumps 15,540 tokens of tool schemas into the context window at session start. The number was already troubling. The new data from Apideck makes it worse: at 40 MCP tools, the upfront cost exceeds 55,000 tokens. Scale to the full catalog and you are paying rent on a house you cannot live in.

What changed between February and now is not just the numbers. Three independent engineering teams, working on unrelated problems, arrived at the same architectural conclusion. Their solutions look different on the surface. Underneath, they share a principle that rewrites how we should think about agent design.

The Apideck Numbers

Samir Amzani at Apideck ran a systematic benchmark comparing MCP tool calls against CLI equivalents. The results are stark.

A simple task (listing GitHub repositories) cost 44,026 tokens through MCP. The same task through CLI: 1,365 tokens. That is a 32x multiplier. For a moderately complex task, the ratio dropped to 4x, but MCP’s absolute cost remained high.

The token waste is only half the story. MCP’s GitHub Copilot server showed a 28% failure rate in Scalekit’s benchmark. Nearly one in three calls failed. The protocol that was supposed to standardize tool interaction is both expensive and unreliable at scale.

Amzani’s proposed alternative: replace MCP tool catalogs with CLI commands that use progressive disclosure. Instead of loading 55,000 tokens of schema upfront, the agent loads a single CLI entry point at roughly 80 tokens. It discovers subcommands on demand, paying 50 to 200 tokens per discovery step. The context budget flips from front-loaded to pay-as-you-go.

There is a governance dimension here that MCP Is Dead; Long Live MCP explored: CLI permission enforcement lives in binary code, not in prompt-based guardrails. A CLI either accepts a command or rejects it. There is no ambiguity for the model to exploit, no prompt injection vector in the tool description. The security boundary is structural.

The Manus Bet

A backend lead from Manus, the AI agent company, posted a blunt confession this month: the team stopped using function calling entirely. No typed tool catalog. No schema definitions. One tool. A single “run” command that accepts Unix-style commands.

This is the radical end of the spectrum. Where Apideck reduces the tool surface from 40 tools to a CLI with progressive disclosure, Manus reduces it to one tool. The agent composes behavior from small, chainable commands instead of selecting from a menu of predefined functions.

The reasoning is practical, not ideological. Large tool catalogs confuse agents. Each additional tool definition is another decision point, another source of hallucinated parameters, another opportunity for the model to pick the wrong tool for the job. Reducing the catalog to one tool eliminates tool selection errors entirely. The agent cannot pick the wrong tool when there is only one tool to pick.

This is the Unix philosophy applied to agent architecture: small tools, composable pipelines, text as the universal interface. It worked for operating systems in the 1970s. It appears to work for agents in 2026. The underlying reason is the same. Composability scales better than enumeration.

The tradeoff is real. A single-command interface sacrifices the structured validation that typed schemas provide. The agent must construct correct command strings from training data rather than filling in typed fields. For teams with strong testing infrastructure, that tradeoff pays off. For teams without it, the missing validation layer will surface as production failures.

The CursorBench Signal

Cursor’s evaluation problem looks unrelated to context windows. It is not.

Naman Jain published CursorBench this month, explaining why Cursor built a proprietary evaluation system instead of relying on public benchmarks. The core finding: public benchmarks are contaminated. OpenAI discovered that nearly 60% of unsolved benchmark problems had flawed test cases. The tests themselves were wrong.

When your evaluation tests are wrong, your model selection is ungoverned. You are choosing models based on scores that measure the wrong thing. This echoes what we documented in Half Your Benchmarks Are Wrong: the verification infrastructure itself becomes a liability when it cannot be trusted.

CursorBench solves this with what Jain calls “Cursor Blame,” an integration that sources evaluation tasks from real Cursor editing sessions. When a model’s suggestion is rejected or undone by a user, that interaction becomes a test case. The evaluation data comes from production behavior, not from static datasets that models have memorized.

The connection to context architecture is indirect but important. CursorBench produces greater model separation at the frontier, precisely where public benchmarks plateau. Models that score identically on HumanEval or SWE-bench show measurable differences on CursorBench tasks. The reason: real-world coding tasks require the model to work within tight context constraints. Parse the relevant file. Ignore the irrelevant ones. Produce a precise edit without rewriting the entire function.

The models that win on CursorBench are the ones that use context efficiently. Not the ones that consume the most.

The Convergence

Three teams. Three different problems. One principle.

Apideck solved token waste by replacing upfront schema loading with progressive CLI disclosure. Manus solved tool confusion by collapsing the entire tool catalog to a single command. Cursor solved benchmark contamination by sourcing evaluation from real context-constrained editing sessions.

Each solution shrinks the agent’s world. Fewer tools loaded at once. Fewer decisions to make. Tighter context constraints. The agent performs better not despite the smaller surface area but because of it.

This is counterintuitive. The natural instinct in platform engineering is to give agents more: more tools, more context, more capability. The evidence from all three teams points the other way. Agents are not humans who benefit from having a full toolbox within arm’s reach. Agents are statistical systems that degrade as the decision space expands.

Every token of tool definition competing for space in the context window is a token not available for reasoning about the actual task. Every additional tool in the catalog is another branch in the decision tree the model must navigate. The cost is not just computational. It is cognitive, in the machine-learning sense. The model’s attention is a finite resource, and tool definitions consume it.

What This Means for Architecture

The practical implications split into three levels.

For individual developers: the CLI-first approach is already the right default. If you are running agents against local codebases, skip the MCP catalog. Use the tools your agent already knows from training data. Pay for context on demand, not upfront.

For platform teams: audit your tool surface area. The question is not “what tools could the agent use?” but “what tools must the agent use for this specific task?” Dynamic tool loading, whether through Anthropic’s Tool Search or through CLI progressive disclosure, should be the standard pattern. Static catalogs that dump everything at session start are an architecture smell.

For organizations: context efficiency is becoming a proxy for agent quality. The CursorBench finding matters here. Models that use context well outperform models that consume it indiscriminately. The same principle applies to the architectures you build around those models. An agent system that loads 40 tools for a task requiring three is not more capable. It is more confused.

The governance angle is worth restating. As we explored in Context Is the New Perimeter, the context window is a boundary. What enters that boundary determines what the agent can do, what it knows, and where it can go wrong. Shrinking the tool surface is not just an optimization. It is a security decision. Fewer tools means fewer attack vectors, fewer hallucination triggers, fewer pathways to unintended behavior.

The Question That Remains

The three solutions described here optimize for different constraints and accept different tradeoffs. Apideck preserves discoverability while reducing upfront cost. Manus maximizes simplicity at the expense of structured validation. Cursor optimizes evaluation fidelity by grounding it in production data.

None of them is universally correct. The right approach depends on your team’s testing maturity, your governance requirements, and whether you are building for individual developers or for organizational deployment. That distinction, as MCP Is Dead; Long Live MCP argued, remains the fault line in every agent architecture debate.

What all three share is the recognition that more is not better. The agent’s world should be as small as the task requires. Not smaller. Not larger. Right-sized.

The context window is not unlimited storage. It is a budget. Spend it on the work.


This analysis synthesizes Your MCP Server Is Eating Your Context Window (March 2026), How We Compare Model Quality in Cursor (March 2026), and insights from the Manus engineering team.

Victorino Group designs agent architectures that maximize capability while minimizing context waste. Let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation