- Home
- The Thinking Wire
- The Spec Layer: Five Independent Teams Discovered the Same Agent Governance Architecture in One Week
The Spec Layer: Five Independent Teams Discovered the Same Agent Governance Architecture in One Week
In March 2026, we documented how three independent governance frameworks (Singapore’s IMDA, the Cloud Security Alliance, and the Auton researchers) converged on a single architectural conclusion: agent specifications are governance artifacts, not documentation. That convergence was top-down. Regulators and standards bodies, working independently, arrived at the same answer.
Within one week at the end of March 2026, five practitioners published independent work that validates the same thesis from the opposite direction. Matt Rickard coined “Spec-Driven Development.” GitHub’s Copilot Applied Science team treated agents like junior engineers governed by process. Google shipped Skills as spec enforcement for Gemini agents. Anthropic’s Claude Code architecture revealed a 512,000-line harness where CLAUDE.md injection is the governance layer. Kent Beck applied TCR (Test && Commit || Revert) to constrain agent execution with automatic rollback.
None of them cited each other. All of them landed on specifications as the primary control surface for agent behavior.
When regulators and practitioners, working independently, solve the same problem the same way, the solution is structural. This is not a trend. It is a discovery.
The Problem All Five Are Solving
Matt Rickard articulated it most clearly: “An AI agent implements a feature. The code compiles. The tests pass. It still misses the point.”
This is the defining failure mode of agentic systems. The agent produces work that is locally valid but globally wrong. It disables failing tests instead of fixing the underlying issue. It reuses the nearest pattern instead of the correct one. It adds code beside old paths instead of replacing them. Every individual decision looks reasonable in isolation. The aggregate result misses the intent.
As we explored in the configuration-dependent safety analysis, the distance between correct behavior and incorrect behavior is often a configuration change. The model is the same. The capability is the same. What changes is the constraint surface around it.
All five teams identified the same root cause: the interface between human intent and machine execution is too wide. The agent has too many degrees of freedom. The solution, in every case, is to narrow that interface through declarative specifications.
Five Architectures, One Pattern
1. Rickard: Spec-Driven Development
Rickard’s framework is the most explicit about the architectural pattern. He calls it Spec-Driven Development (SDD): write durable intent before implementation. The spec is not a plan. It is not a prompt. It is a layered, declarative document that constrains execution.
He draws the lineage deliberately. RFC 791 (IP), RFC 9110 (HTTP), RFC 8446 (TLS), the HTML living standard. These are specifications that constrain implementation without prescribing it. A TCP/IP implementation can differ wildly from another, but both must conform to the spec. SDD applies the same principle to AI agents.
The key insight: “The winning model puts a narrow interface between human intent and machine execution.” The spec is that interface. It is not the prompt (too ephemeral). It is not the model weights (too opaque). It is not the application code (too detailed). It is a declarative layer that sits between what you want and what the agent does.
Rickard names five companies building in this space: GitHub Spec Kit, Kiro, OpenSpec, Tessl, and Intent. The commercial ecosystem is already forming around this pattern.
2. GitHub: Process as Governance
Tyler McGoffin’s report from GitHub’s Copilot Applied Science team reads like a field study in specification-driven governance, even though it never uses that term.
The team created 11 agents and 4 new skills in a sprint of fewer than 3 days. The agents touched 345 files, adding 28,858 lines and removing 2,884. The velocity was extraordinary. But the governance story is more interesting than the speed story.
McGoffin’s core principle: “Blame process, not agents.” When an agent produces wrong output, the failure is not in the agent. The failure is in the specification that constrained it. Strict typing, linting, contract tests, and integration tests enforce architectural compliance. Documentation and architectural standards become the primary governance mechanism.
This is the same conclusion as Rickard, arrived at through practice rather than theory. The GitHub team did not set out to build a governance architecture. They set out to ship software fast with agents. They discovered that the only way to ship fast is to constrain thoroughly. Speed and governance turned out to be the same thing.
McGoffin’s observation lands with precision: “I may have just automated myself into a completely different job.” The role shifts from writing code to writing specifications. From implementation to governance.
3. Google: Skills as Spec Enforcement
Google’s announcement of Gemini API Docs MCP and Developer Skills looks like a developer tooling update. Look closer, and it is a governance architecture.
Skills are structured instructions that connect agents to current documentation and enforce best-practice patterns. Combined with MCP (Model Context Protocol) for documentation access, the system achieves a 96.3% pass rate on correct behavior, with 63% fewer tokens per correct answer compared to vanilla prompting.
That 63% reduction matters. It means the governance layer is not adding overhead. It is reducing it. The agent does less unnecessary work when it has clearer constraints. This mirrors what we documented in context engineering for AI agents: smaller, more structured context outperforms larger, more comprehensive context.
The governance angle is what Google calls, implicitly, governance-as-configuration. Compliance with current standards is baked into the agent’s tooling rather than relying on the agent’s training data. The agent does not need to “know” the right approach. The skill tells it. The spec, loaded at runtime, overrides whatever the model learned during training.
This is the same architectural decision as Rickard’s SDD. The spec sits between intent and execution. But Google’s implementation reveals something additional: the spec is not static. It is a live connection to current documentation. When the documentation changes, the governance changes. No retraining. No redeployment. Configuration.
4. Anthropic: The Harness Is the Governance
Sebastian Raschka’s analysis of Claude Code’s architecture (based on leaked source code) revealed a system where approximately 512,000 lines of TypeScript surround roughly 200 lines of actual API calls.
That ratio (2,560:1) is the architecture speaking. The model is a component. The harness is the system. And the harness is where governance lives.
The key mechanism is CLAUDE.md injection: up to 40,000 characters of codebase-specific guidance loaded into the agent’s context at runtime. File-read deduplication prevents the agent from re-reading files it has already seen. Subagent parallelization splits complex tasks into governed subtasks. Context resets prevent accumulated context from degrading behavior. File-based communication between agents creates an auditable record.
This is spec-driven governance implemented at the infrastructure level. The CLAUDE.md file is a specification. It declares what the agent should do, how it should do it, what it should avoid, and what standards it must meet. The harness enforces that specification through architectural constraints, not through hope.
The “sprint contract” pattern between generator and evaluator agents is particularly revealing. One agent generates. Another evaluates. The contract between them is a specification. Governance is not a separate concern bolted onto the system. It is the architecture of the system itself.
5. Beck: TCR as Hard Governance
Kent Beck’s application of TCR (Test && Commit || Revert) to AI agents is the most radical governance mechanism of the five. It is also the simplest.
The rule: if the tests pass, commit the change. If they fail, revert to the last known good state. No negotiation. No retry. No “let me fix it.” Automatic revert.
Applied to AI agents, this creates a hard constraint that no amount of prompt engineering can override. The agent literally cannot persist code that fails tests. The specification (expressed as tests) is not advisory. It is a physical boundary. Cross it, and the work disappears.
Beck blends TCR discipline with AI Skills to produce what he calls constrained agent execution. The agent is free to try anything. But only work that passes the specification survives. Evolution by selection, applied to code generation.
This is governance reduced to its purest form: the specification defines what survives, and everything else is automatically eliminated. No human review needed. No approval workflow. The spec enforces itself.
The Convergence Pattern
Strip away the implementation differences and a shared architecture emerges across all five.
The spec is the primary control surface. Not the model. Not the code. Not the prompt. Rickard makes it explicit. GitHub discovered it through practice. Google implemented it as Skills. Anthropic built it as a harness. Beck enforced it through automatic reversion.
The spec narrows the interface between intent and execution. Every team found that agents with fewer, clearer constraints outperform agents with broad, vague instructions. Google measured it (63% fewer tokens). GitHub measured it (345 files in 3 days). Beck measured it (only passing code survives). The constraint is not a cost. It is the mechanism that enables velocity.
The spec is declarative, not procedural. None of these systems tell the agent how to do things step by step. They declare what the result must satisfy. The agent retains autonomy within the constraint boundary. This is the same architectural principle as the regulatory frameworks we analyzed in the governance artifacts essay: declare boundaries, not procedures.
The spec sits at runtime, not training time. Google’s Skills update when documentation changes. Anthropic’s CLAUDE.md loads fresh on every session. Rickard’s SDD layers are versioned artifacts that evolve independently of the model. Beck’s tests can change between runs. Governance is a runtime concern, not a training concern. This is consequential. It means governance can adapt faster than models can be retrained.
Why Bottom-Up Validation Matters
The March 2026 regulatory convergence we documented was significant because three independent governance bodies arrived at the same conclusion. But regulatory convergence alone could be dismissed as theoretical alignment. Regulators might agree on a framework that practitioners find impractical.
The practitioner convergence documented here eliminates that objection. These five teams are building production systems. GitHub’s agents shipped code. Google’s Skills are in their public API. Anthropic’s harness runs every Claude Code session. Beck’s TCR pattern has been tested with real agents. Rickard is naming companies that are already commercializing the pattern.
The convergence is now both top-down (regulators) and bottom-up (practitioners). Both groups, working independently, arrived at specifications as the governance mechanism for AI agents. The distance between “good idea in a framework” and “working pattern in production” has closed.
What This Means for Organizations
Three implications for teams deploying AI agents today.
First, invest in specification authorship as a core competency. McGoffin’s observation (“I may have just automated myself into a completely different job”) is the signal. The bottleneck in agent-driven development is not coding. It is specifying. Organizations that treat specs as overhead will fall behind organizations that treat specs as the product.
Second, build your governance layer at runtime, not training time. All five architectures place the specification between the model and the execution environment, loaded fresh for each task. This means governance can iterate at the speed of configuration changes rather than the speed of model retraining. If your governance depends on “the model knowing the right thing to do,” you are building on sand.
Third, test your specifications with the same rigor you test your code. Beck’s TCR pattern is the extreme version, but every team here validates agent output against specifications automatically. The spec is only as good as its enforcement mechanism. A spec in a wiki that no one reads is not governance. A spec in CI that blocks deployment is.
The spec layer is forming. The question is no longer whether specifications will govern AI agents. Five independent teams, in one week, showed that they already do. The question is whether your organization will design its spec layer deliberately or discover it accidentally during an audit.
This analysis synthesizes The Spec Layer by Matt Rickard (April 2026), Agent-Driven Development from GitHub (March 2026), Gemini API Skills from Google (April 2026), Claude Code’s leaked source analysis (March 2026), and Kent Beck’s TCR Skill sessions (April 2026).
Victorino Group helps organizations build specification-driven governance for AI agent fleets. Let’s talk.
All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →
If this resonates, let's talk
We help companies implement AI without losing control.
Schedule a Conversation