Passive Context Wins: Why AGENTS.md Outperforms Skills in AI Agent Evals

Vercel ran an experiment. They tested different documentation strategies for AI coding agents working with Next.js 16 APIs. The results challenge a common assumption about how we should feed information to agents.

The numbers tell a clear story:

Approach	Pass Rate
Baseline (no documentation)	53%
Skills with default behavior	53%
Skills with explicit instructions	79%
AGENTS.md documentation index	100%

The skill-based approach performed identically to having no documentation at all. Why?

The Real Problem: Skills Never Get Called

In 56% of evaluation cases, the agent never invoked the skill it needed. The information existed. The mechanism worked. But the agent did not decide to use it.

This is the activation problem. When information requires a decision to access, agents frequently fail to make that decision. Not because they lack capability, but because choosing to retrieve context is itself a reasoning step that can fail.

Think about it: the agent must first recognize it needs information, then decide which skill to invoke, then parse the response, then continue reasoning. Each step is a potential failure point. In production, these failure points compound.

Why Passive Context Wins

AGENTS.md takes a different approach. Instead of requiring the agent to actively retrieve documentation, it embeds a compressed index directly into the system prompt. The information is present every turn.

Three characteristics make this effective:

No activation decision required. The agent does not need to recognize it needs help. The context is already there. This eliminates an entire category of failure.

Consistent accessibility. Whether the task is simple or complex, whether the agent is on turn two or turn twenty, the documentation index remains in the prompt. There is no variability in access.

No sequencing problems. With skills, agents face a dilemma: explore the project first, then read docs? Or read docs first, then explore? This creates timing failures where agents dive into implementation before understanding the APIs. Passive context sidesteps this entirely.

The Compression Problem

Raw documentation cannot fit in system prompts. The Next.js docs Vercel tested were 40KB. They compressed them to 8KB — an 80% reduction — using a pipe-delimited structure.

The target was Next.js 16 APIs that exist outside model training data:

connection()
'use cache' directive
cacheLife()
cacheTag()
forbidden()
unauthorized()
async cookies() and headers()

These APIs are new. Models cannot rely on pre-trained knowledge. Documentation is essential.

The compression strategy matters. Pipe-delimited formats are denser than markdown. Every token counts when you are competing for space in a finite context window.

Anthropic’s Context Engineering Principles

This aligns with how Anthropic frames context engineering: finding the smallest possible set of high-signal tokens that maximize the likelihood of the desired outcome.

They distinguish two approaches:

Pre-computation (passive): Embedding-based retrieval surfaces all relevant information upfront. The agent starts with everything it needs. Cost: larger prompts. Benefit: no retrieval failures.

Just-in-time (active): Agents maintain lightweight identifiers and dynamically retrieve details when needed. Cost: retrieval can fail. Benefit: smaller prompts.

Claude Code uses a hybrid model. CLAUDE.md loads immediately into context. But the agent also has grep and glob tools for runtime exploration. The baseline context is always present; additional details are fetched on demand.

The Vercel experiment suggests the balance should tilt toward pre-computation, at least for critical documentation. The cost of always-present context is lower than the cost of retrieval failures.

Practical Recommendations

If you are building systems that rely on AI agents understanding documentation:

Do not rely on skills as your primary delivery mechanism. Skills work when agents decide to invoke them. In 56% of cases, they do not. That is not a acceptable failure rate for production systems.

Compress aggressively. The difference between 40KB and 8KB is the difference between fitting in context and not fitting. Use structured formats. Remove prose. Keep signal.

Build evaluation suites targeting APIs outside training data. If your agent appears to know something without documentation, it might be relying on training data that will become stale. Test with new APIs to validate your documentation strategy.

Structure documentation for direct retrieval rather than upfront loading. When you do need active retrieval, make the target clear. File paths, not descriptions. The agent should know exactly what to fetch.

Consider the activation tax. Every time you require an agent to decide to retrieve information, you pay a tax in reliability. Sometimes that tax is worth paying. Often, it is not.

Implementation

Vercel provides a codemod to generate AGENTS.md for Next.js projects:

npx @next/codemod@canary agents-md

This creates a compressed documentation index specific to Next.js 16 APIs. The approach generalizes: any framework-specific knowledge that agents need reliably should live in passive context, not behind a skill invocation.

The Broader Lesson

The skill-based approach feels elegant. Define capabilities. Let the agent choose. Trust its judgment.

But agent judgment has limits. Recognizing when to use a skill requires meta-cognition that current models handle inconsistently. The activation decision is itself a reasoning step, and reasoning steps fail.

Passive context is less elegant but more reliable. The information is always there. No decision required. No failure mode around “should I look this up?”

This maps to a broader principle in system design: reduce optionality at points where optionality introduces failure. Give the agent fewer choices to make, and it makes fewer mistakes.

The 100% vs 53% comparison is stark. When documentation strategy alone doubles your pass rate, you have found a leverage point worth optimizing.

At Victorino Group, we design AI agent systems that work reliably in production. Context strategy is one of the levers that determines whether agents succeed or fail. If you are building with AI agents and reliability matters, let’s talk.