What Hundreds of Skills Taught Anthropic About Governing AI Agents

We wrote about the governance theory of skills three weeks ago. Modular capabilities, audit trails, progressive context loading. The structural argument for why skills make agents governable.

Theory is necessary. It is not sufficient.

Thariq, an engineer at Anthropic, recently published an insider account of how Anthropic actually uses skills in production. Not the specification. Not the marketing pitch. The operational reality of hundreds of skills running across teams, with all the friction and failure modes that implies.

The lessons are worth dissecting. Not because Anthropic has all the answers, but because they have something rarer: operational data from a team that eats its own cooking at a scale most organizations have not reached.

Skills Are Not Prompts. They Are Environments.

The most common misunderstanding about skills is that they are instruction files. A SKILL.md with some text, loaded into context. Thariq’s account corrects this.

The skills that work at Anthropic are folder structures. Scripts that execute. Configuration files that set boundaries. Hooks that fire on specific events. Reference documents the agent discovers as needed. The SKILL.md file is the entry point, not the whole thing.

This matters because it changes the mental model. A prompt is static. An environment is interactive. The agent does not just read the skill and follow instructions. It navigates the skill’s file system, discovers context incrementally, and executes scripts that produce deterministic outputs.

As we explored in Context Engineering for AI Agents, effective agent systems treat context as an architecture problem, not a content problem. Thariq’s account provides the operational proof. The file system itself becomes a progressive disclosure mechanism. The agent loads what it needs when it needs it, rather than consuming everything upfront.

The Taxonomy Nobody Asked For (But Everyone Needs)

Anthropic categorizes their skills into nine types: Library Reference, Product Verification, Data Fetching, Business Process, Code Scaffolding, Code Quality, CI/CD, Runbooks, and Infrastructure Ops.

The taxonomy is imperfect. Runbooks and Infrastructure Ops overlap. Code Quality and Product Verification share a fuzzy border. Thariq acknowledges that skills straddling categories tend to confuse agents. If a skill is hard to categorize, it is probably trying to do too much.

The value of the taxonomy is not the specific categories. It is the discipline of having one. Most teams accumulate skills organically. No naming convention. No categorization. No way to answer “how many verification skills do we have?” or “which skills touch production infrastructure?” Skill bloat is already a documented problem in the community, and it follows the same trajectory as microservice sprawl: easy to create, hard to govern, painful to deprecate.

A taxonomy forces choices. Is this a runbook or an infrastructure operation? If the answer is “both,” the skill needs to be split. That constraint produces cleaner boundaries, and cleaner boundaries produce more predictable agent behavior.

Verification Is the Bottleneck, Not Generation

Here is the insight with the largest practical implication. Thariq says it is “worth having an engineer spend a week” making verification skills excellent. A full week of engineering time on a single skill.

Why? Because in agentic systems, code generation is the easy part. Every model can produce plausible code. The hard part is knowing whether that code is correct, safe, and consistent with existing patterns.

We documented this pattern in The Evidence Is In: AI Coding Agents Are Breaking Things. Amazon’s thirteen-hour outage. Anthropic’s own UX bug that persisted for months. Stack Overflow’s survey where nearly half of developers said debugging AI code takes longer than writing it from scratch. In each case, generation outpaced verification.

Thariq’s recommendation is structural: invest disproportionately in verification skills. Build skills that check whether generated code matches architectural conventions. Build skills that run integration tests against specific subsystems. Build skills that verify deployment prerequisites before any code reaches production.

The asymmetry is important. A code scaffolding skill saves minutes. A verification skill prevents outages. The ROI calculation is not even close.

The Gotchas Section Is the Actual Content

Thariq’s most counterintuitive recommendation: the highest-value content in a skill is the “Gotchas” section. Not the instructions. Not the examples. The warnings about what will go wrong.

This aligns with a finding from ETH Zurich. Researchers tested how instruction files (like AGENTS.md and CLAUDE.md) affect coding agent performance. Verbose instruction files degraded performance by approximately 3% and increased costs by 20%. More words produced worse results.

The implication is uncomfortable for teams that have been stuffing their skills with comprehensive documentation. The agent does not need a tutorial. It needs to know where the mines are buried.

“Don’t state the obvious” is Thariq’s phrasing. The model already knows how to write Python. It already knows how to structure a REST API. What it does not know is that your team’s CI pipeline breaks if you import module X before module Y, or that the staging database resets every four hours, or that the deployment script silently fails if the environment variable is set but empty.

Those are the Gotchas. They are the difference between a skill that works in a demo and a skill that works in production.

On-Demand Hooks: Modular Behavior Injection

One pattern from Thariq’s account deserves attention from teams building governed agent systems. Skills can bundle their own hooks, activated only when the skill is invoked.

This is significant because it enables modular behavior injection. A deployment skill can attach a PreToolUse hook that blocks file writes to protected directories. A data analysis skill can attach a hook that logs every database query. These hooks exist only while the skill runs. They do not pollute the agent’s baseline behavior.

The governance implication: skills become self-contained policy units. The skill defines not just what the agent can do, but what the agent must check before doing it. As we argued in Skills Are Not Replacing Agents. They Are Making Agents Governable, the structural separation of capabilities is what makes audit possible. On-demand hooks extend that separation to runtime behavior.

Anthropic also uses PreToolUse hooks to measure skill usage. Which skills get invoked, how often, and by which teams. This is the telemetry layer that most skill implementations lack, and without it, governance is aspirational at best.

What Thariq Does Not Say

The honest analysis requires noting what is absent from the account.

Security gets no mention. In February 2026, Check Point Research disclosed CVEs (CVE-2025-59536, CVE-2026-21852) showing that malicious skills can execute arbitrary code on the host machine. Skills load scripts. Scripts run with the agent’s permissions. A compromised skill in a shared marketplace is a supply chain attack vector. Thariq’s account describes an internal team at a company with deep model expertise. External teams using community skills face a different threat model entirely.

Failure cases are invisible. The lessons come from skills that work. How many skills were abandoned? Which patterns seemed promising and failed? Survivorship bias is inherent in any “lessons learned” post, and acknowledging it does not eliminate it.

The “hundreds of skills” claim is contextless. What counts as active use? A skill invoked once a month by one engineer? A skill invoked hundreds of times daily across the organization? Without usage metrics, “hundreds in active use” tells us about breadth but nothing about depth.

Anthropic is not a typical organization. Their engineers have access to model internals, can debug agent behavior at a level external teams cannot, and have a feedback loop to the model development team that no customer will ever replicate. Lessons from this environment may not transfer cleanly to a financial services firm or a healthcare system.

The Marketplace Question

Thariq describes a governance pipeline for skills: sandbox first, build traction, then submit through a PR process. Enterprise admins can provision skills centrally. The progression mirrors how mature organizations handle internal tools.

But the marketplace model introduces a tension that the account does not resolve. Centralized provisioning implies trust. Trust in a skill’s security, reliability, and alignment with organizational policy. The Check Point disclosure showed that trust can be exploited. A skill that passes code review today can be updated tomorrow with a malicious hook.

The parallel to package management is exact, and instructive. npm, PyPI, and Docker Hub all faced supply chain attacks despite review processes. Skills are smaller and simpler than packages, which reduces attack surface. They are also newer and less scrutinized, which increases risk. The equilibrium has not been found yet.

What Practitioners Should Take From This

Three operational patterns from Thariq’s account translate directly to teams building governed agent systems.

First, invest in verification skills disproportionately. If your team has ten skills and none of them verify output, you have built a generation pipeline without quality controls. The verification-to-generation ratio is the metric that predicts whether your agent system produces value or incidents.

Second, write Gotchas, not tutorials. Strip your skills of everything the model already knows. What remains is your organization’s institutional knowledge: the undocumented constraints, the tribal knowledge, the failure modes that only surface in production. That is what belongs in a skill.

Third, build the taxonomy before the skills. Decide on categories. Enforce that every skill belongs to exactly one. When a skill wants to straddle two categories, split it. The discipline is unpleasant and necessary. As we documented in How to Build Self-Improving Coding Agents, systems that improve over time require structure. Skills without taxonomy become the agent equivalent of a shared drive full of untitled documents.

One additional lesson sits beneath the operational patterns. Anthropic’s account, for all its value, describes a team with unusual advantages. Deep model knowledge. Direct feedback loops to researchers. Internal tooling that external teams will never see. The patterns transfer. The context does not. Teams adopting these patterns need to account for the distance between their environment and Anthropic’s, and build the verification infrastructure to close it.

This analysis synthesizes Thariq’s “Lessons from Building Claude Code: How We Use Skills” (March 2026), the ETH Zurich study on AGENTS.md effectiveness (February 2026), and Check Point Research’s Claude Code CVE disclosure (February 2026).

Victorino Group helps organizations build AI agent systems with governed skill architectures. Let’s talk.