Engineering Notes

Seeing Like an Agent: What Claude Code's Tool Design Reveals About Building AI Systems

TV
Thiago Victorino
10 min read
Seeing Like an Agent: What Claude Code's Tool Design Reveals About Building AI Systems

Thariq, an engineer on Anthropic’s Claude Code team, published something unusual in March 2026: a candid account of four tool design iterations that failed before they worked. No benchmarks. No marketing framing. Just the sequence of decisions and what each taught the team about how models interact with tools.

This matters because we have been citing external studies about agent tool design for months. Azure SRE, Vercel evaluations, academic papers. Now the team that builds the tools is showing the iteration history. The evidence is first-party.

The four lessons cover elicitation (asking users questions), task management, search, and progressive disclosure. Each one follows the same arc: the team designed a tool that made sense to humans, discovered the model interacted with it differently than expected, then redesigned around the model’s actual behavior.

That arc is the lesson. Not the specific tools.

Lesson 1: Tools Are Behavioral Instructions

The Claude Code team needed a way for the agent to ask users clarifying questions. They tried three approaches.

First: an ExitPlanTool parameter that combined a plan with questions. The model treated plan generation and question-asking as conflicting objectives. It produced confused outputs.

Second: a modified markdown output format that asked for structured questions. The model added extra sentences, omitted options, and broke the format inconsistently.

Third: a dedicated AskUserQuestion tool with explicit structure (question text, options, follow-up behavior). This worked. Not because the tool was technically superior. Because the interface aligned with how the model represents the task internally.

Thariq’s phrasing: “Even the best designed tool doesn’t work if Claude doesn’t understand how to call it.”

We wrote about this principle from the outside in Context Is the New Perimeter, where we analyzed how tool descriptions function as behavioral instructions. The Claude Code team’s iteration history is direct confirmation. Tool design is not API design. It is behavioral design. The consumer is not a developer reading documentation. It is a model parsing structured input.

As we documented in AI Agents Have Opinions, analysis of 2,430 Claude responses showed 90% agreement on tool picks across model versions. The AskUserQuestion evolution explains the mechanism: when tool interfaces align with model behavior, selection becomes deterministic. When they conflict, output quality degrades regardless of the model’s capability.

Lesson 2: Good Tools Become Bad Tools

The second lesson is more disorienting. Tools that work well today may constrain the model tomorrow.

Claude Code originally used TodoWrite for task tracking: a simple list that the agent updated as it worked. System reminders every five turns kept the list visible. This worked.

Then models improved. Opus 4.5 brought better multi-step reasoning. Suddenly the reminders became a problem. Instead of adapting its plan to new information, the model would stick rigidly to the existing list because the system kept reasserting it. The tool had become a constraint.

The problem deepened when subagents entered the picture. Multiple agents needed to coordinate on shared tasks. TodoWrite was designed for single-agent use. There was no mechanism for one agent to see another’s progress, declare dependencies, or modify shared state.

The replacement, a Task Tool with dependencies, shared updates, and the ability to alter or delete tasks, solved both problems. But the interesting part is not the solution. It is the diagnosis: a well-designed tool became a bottleneck because the model outgrew it.

Thariq: “As model capabilities increase, the tools that your models once needed might now be constraining them.”

This has a direct governance implication. Tool inventories need version reviews, the same way security teams review access permissions. A tool that correctly bounded agent behavior six months ago may now be preventing the model from using capabilities it has gained since. The review cadence matters.

Lesson 3: Found Context Beats Given Context

The search evolution is the most technically revealing of the four lessons.

The team started with RAG (retrieval-augmented generation). It worked but required indexing infrastructure, broke when repositories changed, and produced a specific failure mode: context was given to the model rather than found by it. The model received relevant documents but lacked the search process that would have helped it understand why those documents were relevant.

They switched to Grep. The model searched for what it needed and built its own context incrementally. Performance improved. Then Skills formalized this pattern into progressive disclosure: the model starts with minimal context, searches for more when needed, and can perform recursive file reading and nested search across several layers.

Thariq: “Over a year, Claude went from not building its own context to nested search across several layers.”

This confirms something we explored in Context Engineering for AI Agents: Azure’s SRE team reached the same conclusion from the opposite direction. They started with 100+ narrow tools, found that “we hadn’t built an agent, we’d built a workflow with an LLM stapled on,” and collapsed to approximately five broad CLI tools. Both teams converged on the same principle. Agents perform better when they find context than when they receive it.

Anthropic’s own engineering blog quantified part of this. Their Tool Search mechanism reduced context by 85% while improving Opus 4 accuracy from 49% to 74%. Programmatic tool calling added 37% token savings. These numbers validate the architecture, not just the intuition.

Lesson 4: Expanding Capability Without Expanding Tools

Claude Code has approximately 20 tools. Adding a new one is expensive: it increases the action space, risks confusing the model, and invalidates cached prompts.

The team needed Claude Code to understand itself (its own features, commands, and capabilities) but adding a documentation tool would expand the visible toolset. Loading docs into the system prompt would cause context rot. Linking to external documentation returned too many results.

Their solution: a Guide Agent. A dedicated subagent with instructions on how to search Claude Code’s own documentation. No new tool added to the main agent’s interface. The capability was available through the existing subagent mechanism.

Thariq: “We were able to add things to Claude’s action space without adding a tool.”

This is progressive disclosure applied to the agent itself. The ToolChain* paper (2025) formalizes this as a tree search problem: navigating action spaces where not all tools are visible at any node. Claude Code’s Guide Agent is a practical implementation of that theoretical framework.

We covered the governance implications of skills-as-progressive-disclosure in Skills Are Not Replacing Agents. The Claude Code team’s approach confirms the pattern: skills and subagents let you expand what an agent can do without expanding what it must evaluate at each decision point.

What the Thread Does Not Say

The four lessons are genuinely useful. They also have limits that practitioners should recognize.

No quantitative outcomes. The thread describes iteration sequences but provides no task completion rates, A/B test results, or error rate comparisons between tool versions. Azure’s SRE team, working on a comparable problem, published that handoffs beyond four had near-total failure rates and that broad tools outperformed narrow ones by measurable margins. The Claude Code team’s account is qualitative where it could be quantitative.

Progressive disclosure has contradictory evidence. Vercel’s evaluation in January 2026 found that Skills (Anthropic’s progressive disclosure mechanism) activated only 53% of the time with default settings, identical to having no documentation at all. Passive context via AGENTS.md files hit 100% pass rates. With explicit instructions, Skills improved to 79%. This does not invalidate progressive disclosure, but it complicates the narrative. Found context may beat given context in some configurations while losing in others.

“Art not science” is an interim position. Thariq frames tool design as craft-based intuition. The agent evaluation field (MCPVerse benchmarks, AgentIF framework, FeatureBench) is making tool design empirical. The craft framing is honest about where the field is today. It should not be mistaken for where the field will stay.

Commercial interest is present. Every design decision in the thread doubles as product marketing for Claude Code and the Agent SDK. This does not make the lessons wrong. It means the selection of which lessons to share was not random. The team chose stories that showcase their product’s architecture.

The Governing Principle

Across all four lessons, one principle recurs: tools govern agent behavior more reliably than instructions.

Explicit instructions (“always ask clarifying questions before proceeding”) produce inconsistent compliance. A dedicated tool with structured parameters produces consistent behavior. System prompt reminders (“check your task list regularly”) caused rigidity. A Task Tool with dependency management enabled coordination. RAG-injected context degraded over time. Search tools that let the model find its own context improved over time.

This is the principle we articulated in Agents Are Not Tools: the architecture of the interface defines the behavior of the system. The Claude Code team’s iteration history is a case study of that principle applied four times, failing first, then succeeding through redesign.

For organizations building agent systems, the implication is concrete. Invest less time writing instructions. Invest more time designing tool interfaces. The tool is the instruction.

And review those tools regularly. Today’s well-designed interface is tomorrow’s constraint.


This analysis synthesizes Thariq’s “Lessons from Building Claude Code: Seeing like an Agent” (Mar 2026), Microsoft’s Azure SRE Agent case study (Jan 2026), Vercel’s agent evaluation results (Jan 2026), Anthropic’s advanced tool use documentation, and the ToolChain* paper on action space navigation (2024).

Victorino Group helps organizations design governed AI agent systems that scale. Let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation