Engineering Notes

What Claude Code's /insights Reveals About Measuring AI-Assisted Development

TV
Thiago Victorino
8 min read

Most teams measure AI tool usage by counting tokens spent. This is like measuring a developer’s productivity by counting keystrokes. It tells you something happened. It tells you nothing about what.

Anthropic’s Claude Code ships with a native command called /insights that takes a different approach. Instead of quantitative metrics, it performs qualitative analysis of your sessions using an LLM. Run it, and you get an interactive HTML report about how you actually use the tool — what works, what fails, where friction lives.

The architecture behind this command is worth studying. Not because everyone needs to build their own /insights, but because the pipeline design reveals how to build observability into agentic systems — a problem every engineering team adopting AI agents will face.

The 6-Stage Pipeline

The /insights command processes your usage data through six sequential stages:

1. Filtering. Not all sessions are worth analyzing. The pipeline removes agent sub-sessions (spawned by the main session), internal facet-extraction sessions, sessions with fewer than 2 messages, and sessions shorter than 1 minute. This is a deliberate choice: noise reduction before analysis, not after.

2. Summarization. Session transcripts that exceed 30,000 characters are chunked into 25,000-character segments and summarized. This handles the reality that some coding sessions are long, rambling affairs. The summarization preserves intent while discarding verbatim transcripts.

3. Extraction. Each session is analyzed for structured facets: 13 goal categories (debugging, refactoring, greenfield development, etc.), 12 friction types (context loss, tool failures, unclear instructions, etc.), and user satisfaction signals. The output is structured JSON, not free text. This is the step where qualitative observation becomes quantifiable data.

4. Aggregation. Seven separate aggregation prompts synthesize the extracted facets across all sessions. These prompts cover: project areas you work on, your interaction style, what workflows succeed, where friction concentrates, improvement suggestions, and time horizons (short-term patterns vs. long-term trends).

5. Summary. An executive-level synthesis produces an “at a glance” view: what is working, what is hindering you, quick wins you could implement, and ambitious workflows worth attempting.

6. Rendering. Everything is assembled into a self-contained HTML report you can open in a browser. No server, no dashboard, no telemetry. A local file.

Technical Constraints Worth Noting

The pipeline uses Haiku (Anthropic’s smallest, fastest model) with an 8,192 max token budget per call. It processes up to 50 sessions per run. All data stays local — stored in ~/.claude/usage-data/, never transmitted.

These are not arbitrary numbers. They reflect engineering trade-offs:

Haiku over Opus or Sonnet. A meta-analysis tool should not cost more than the work it analyzes. Using the cheapest model for introspection keeps the feedback loop economically sustainable. You want engineers running /insights weekly, not rationing it.

50 sessions per run. Enough to surface patterns, not so many that the analysis takes forever or the aggregation prompts lose coherence. This is a context window management decision disguised as a product constraint.

100% local. This is the single most important design decision. Usage analysis involves your actual work — codebases, error messages, conversations about architecture. Sending this to a remote dashboard is a non-starter for security-conscious teams. The local-only design makes the feature usable in environments where cloud telemetry would be blocked.

What the Architecture Teaches

1. Filter Before You Analyze

The first stage is removal. Most observability systems do the opposite — they collect everything and filter at query time. The /insights approach is better for LLM-based analysis because every token in the context window matters. Sub-sessions, trivial interactions, and noise degrade the quality of downstream extraction and aggregation.

If you are building observability for your own agentic workflows, filter aggressively at ingestion. Define what a “real session” is. Discard the rest before your analysis model sees it.

2. Structured Extraction Is the Hard Part

Turning free-form conversation transcripts into structured JSON with 13 goal categories and 12 friction types is the core intellectual work of this pipeline. The filtering, summarization, and aggregation stages are infrastructure. The extraction stage is where the model of understanding lives.

This maps directly to a broader pattern in agentic systems: the value is in the taxonomy. Anyone can build a pipeline. The teams that win are the ones who define precise categories for what their agents do, where they fail, and why users are satisfied or frustrated.

3. Multiple Aggregation Prompts Beat a Single Summary

Seven separate aggregation prompts, each focused on a different dimension, produce richer analysis than a single prompt asking for everything. This is prompt engineering applied to analytics: decompose the question, synthesize the answers.

Teams building internal AI analytics should steal this pattern. Do not ask one prompt to analyze everything. Ask seven focused prompts and compose the results.

4. The Feedback Loop Is the Product

The real value of /insights is not the report. It is the cycle: use Claude Code naturally, run /insights, update your CLAUDE.md configuration based on the findings, adopt suggested features you were not using. The report is an artifact. The behavior change is the product.

This is the same principle that makes retrospectives valuable in software teams — but automated and based on data rather than memory.

The Broader Ecosystem

/insights does not exist in isolation. Claude Code also ships /reflect, which provides per-session reflection immediately after a session ends. And the open-source community has built tools like claude-reflect (by BayramAnnakov), which auto-captures reflection data for every session.

The three tools form a hierarchy:

  • claude-reflect: automatic, per-session, raw capture
  • /reflect: manual, per-session, immediate analysis
  • /insights: periodic, cross-session, pattern recognition

This mirrors how good engineering organizations think about monitoring: you need real-time alerts (reflect), periodic reviews (insights), and continuous background capture (claude-reflect) to build a complete picture.

Critical Perspective

Several limitations are worth acknowledging.

Model quality ceiling. Using Haiku for analysis means the extraction and aggregation are limited by the smallest model’s reasoning capacity. Complex patterns in how you use Claude Code may be missed because Haiku cannot detect them. The cost trade-off makes sense, but it is a trade-off.

50-session window. If your usage patterns shift over months, the 50-session window may miss long-term trends. There is no longitudinal view that compares your January patterns to your June patterns.

No team-level analysis. The tool is strictly individual. For engineering leaders trying to understand how their team uses AI tools — where the real governance questions live — /insights offers no aggregation across developers. Building that layer while preserving privacy would be valuable.

Self-referential bias. An Anthropic model analyzing usage of an Anthropic tool has an inherent bias toward recommending more usage of Anthropic features. The suggestions section should be read with this in mind.

No comparison baseline. The report tells you what happened, but not how it compares to other developers, other projects, or your own past performance. Patterns without baselines are observations, not insights.

Practical Implications for Engineering Teams

If you are running a team that uses Claude Code or any AI coding assistant, three things follow from this analysis:

Build your own feedback loop. Even if you do not use /insights, the pattern matters. Your team needs a mechanism for periodically reviewing how they interact with AI tools and adjusting their workflows. Without this, engineers develop habits — good and bad — that nobody examines.

Define your taxonomy. The 13 goal categories and 12 friction types in /insights are Anthropic’s model of how developers use their tool. Your organization’s categories will differ. Define them. What are the 5 things your team uses AI agents for? What are the 5 ways those interactions fail? Without a taxonomy, you cannot measure, and without measurement, you cannot improve.

Instrument your agents. If you are building custom agents using Claude, GPT, or any other model, build observability in from the start. Log sessions. Classify outcomes. Create periodic summaries. The hardest part of governing AI in production is not preventing failures — it is understanding how the tools are actually being used.

The /insights command is a small feature in a large tool. But its architecture encodes a principle that most organizations have not yet internalized: measuring AI-assisted work requires qualitative analysis, not just quantitative metrics. Token counts tell you about cost. Understanding how your team uses AI tools tells you about value.


Source: Rob Zolkos. “Deep dive: Claude Code /insights command.” zolkos.com, February 4, 2026. Anthropic Claude Code documentation.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation