The Most Governed Software Factory You've Never Heard Of

StrongDM’s Software Factory operates under two rules. Rule one: no human writes code. Rule two: no human reviews code.

The surface reaction is predictable. Reckless. Irresponsible. A security company --- of all companies --- removing humans from the software development loop. Justin McCarthy, StrongDM’s CTO, founded the Factory in July 2025 with a three-person team. By February 2026, they had shipped production software built entirely by AI agents, validated by AI agents, and maintained by AI agents. The manifesto at factory.strongdm.ai reads like a provocation.

It is not.

Look past the rhetoric and something unexpected emerges. The team that removed humans from coding built one of the most disciplined governance frameworks in the industry. Every technique they describe --- scenarios, digital twins, weather reports, context databases --- is a governance mechanism wearing engineering clothes. The Factory is not radical because it removed human oversight. It is radical because it replaced human oversight with something more rigorous: machine-enforceable policy at every layer of the development process.

The lesson is not that you should remove humans from code. The lesson is that governance so deeply encoded into the system makes ad-hoc human gatekeeping redundant.

Scenarios Are Policy, Not Tests

The Factory’s core validation technique revolves around what they call scenarios. The word sounds casual. The architecture is not.

A scenario is an end-to-end behavioral specification, stored outside the codebase. Not a unit test. Not an integration test. A complete description of what the software should do, expressed in terms a non-engineer can read and an LLM can evaluate. The process flows in a tight loop: seed the work, run a validation harness, feed results back, iterate.

The critical design decision is storing scenarios outside the codebase they validate. In traditional testing, tests live alongside code in the same repository. Developers who write code also write and modify tests. The auditor and the auditee share the same environment. The Factory breaks this coupling. Scenarios exist independently, serving as external validation criteria that the code-generating agents cannot influence.

This is the separation of concerns that governance frameworks have demanded for decades, implemented as engineering architecture.

There is a direct lineage here. In our previous analysis of Nicholas Carlini’s 100,000-line compiler project --- where 16 AI agents built a C compiler by targeting an existing compiler as an oracle --- the core insight was identical. Carlini’s agents did not read a specification document. They targeted the observable behavior of GCC. The test suite was the specification. StrongDM’s scenarios operate on the same principle, but generalized: define expected behavior externally, let agents converge toward it, measure satisfaction probabilistically.

That last part deserves emphasis. The Factory moved from binary pass/fail to probabilistic satisfaction. Rather than asking “does this test pass?” they ask a different question: across all observed trajectories through all scenarios, what fraction likely satisfy the user? This is how you evaluate software that has agentic components --- where the output is not deterministic and success is a distribution, not a boolean.

The Factory treats code the way machine learning treats model weights: opaque artifacts validated entirely by behavior. You do not review the weights. You measure what the model does. StrongDM applies this philosophy to software: you do not review the code. You measure what the software does across hundreds of scenarios.

This is a governance philosophy, not just an engineering technique. It shifts the question from “is the code correct?” to “does the system behave as intended?” --- and answers it with statistical rigor rather than human judgment.

The Digital Twin as Controlled Environment

The most ambitious piece of the Factory’s infrastructure is the Digital Twin Universe, or DTU. The team built behavioral clones of the third-party services their software depends on: Okta, Jira, Slack, Google Docs, Google Drive, Google Sheets. These are not mocks or stubs. They are functional replicas that mirror the APIs, edge cases, and observable behaviors of the real services.

From an engineering perspective, this is an advanced staging environment. From a governance perspective, it is something more important: a controlled testing environment that operates at volumes exceeding production, without exposing real customer data.

Staging environments are not new. Every mature engineering organization has one. But the DTU pushes the concept further in two important ways.

First, it eliminates the constraints that make traditional integration testing incomplete. Rate limits from third-party APIs, abuse detection triggers, accumulating API costs --- these practical barriers typically mean integration tests run infrequently or cover only critical paths. The DTU removes all of them. Agents can run thousands of scenarios per hour against the digital twins without hitting any external constraint. This is testing density that production-connected environments cannot achieve.

Second, it addresses a problem specific to agentic software. When your product includes AI agents that interact with external services --- and StrongDM’s does --- you need to test the agent’s behavior in realistic conditions without the agent actually modifying real customer data in Jira or Slack. The DTU creates a sandbox where agentic workflows execute against realistic service replicas. The agent does not know the difference. The customer’s data is never at risk.

This is governance through environment design. You do not need a policy document that says “never test with production data.” You build an environment where production data is architecturally inaccessible. The constraint is structural, not procedural.

Weather Report: Operational Oversight as Routine

The Factory publishes what it calls a Weather Report. Updated regularly, it describes which AI models handle which types of work, how they perform, and where routing decisions have changed.

As of February 2026, the routing looks like this: gpt-5.3-codex handles high-complexity tasks like computer science problems, frontend architecture, and security review. Opus 4.6 handles DevOps and QA orchestration. Frontend aesthetics go to Opus 4.6. Architectural critique goes to gpt-5.2 with elevated parameters. Sprint planning uses a consensus merge that combines independent analyses from multiple models.

This is not a benchmark publication. It is an operational service catalog.

Every mature engineering organization maintains a catalog of its infrastructure: which services run where, what their SLAs are, when they were last updated. The Weather Report applies this discipline to AI model operations. It makes model selection decisions explicit, auditable, and shared. Anyone on the team --- or, because it is published, anyone in the industry --- can see which model handles which workload and why.

This transparency is the direct antidote to shadow AI. Gartner predicts that 40% of firms will face shadow AI security incidents. Shadow AI thrives when model usage is invisible --- when individual engineers or teams choose models ad hoc, without centralized knowledge of what is running where. The Weather Report eliminates this by making model operations a public, routinely updated artifact.

The format matters as much as the content. By treating model routing as something that deserves regular publication --- like an SRE team’s status page --- the Factory normalizes operational oversight of AI systems. It transforms “which model are we using?” from an ad hoc question into an institutional practice.

The $1,000 Metric

McCarthy offers a benchmark for software factory maturity: if you have not spent at least $1,000 on tokens per human engineer per day, your software factory has room for improvement.

The number is provocative. At roughly $20,000 per month per developer in token costs, it exceeds what most organizations spend on their entire cloud infrastructure per developer. The instinct is to dismiss it as extravagance.

Reframe it as an operational KPI instead.

Every mature cloud organization tracks spend efficiency metrics. Cost per request. Cost per transaction. Cost per active user. These metrics exist not because spending money is good, but because the ratio between spend and output reveals operational maturity. A team spending $5,000 per month on cloud and serving ten million users has a different operational posture than a team spending $5,000 and serving one thousand users.

The $1,000 benchmark works the same way. It is not a prescription to burn money. It is an observation that when AI agents handle the majority of development work, token consumption is a leading indicator of agent utilization. Low spend means agents are idle or underused. The metric forces a question: are you actually operating a software factory, or are you doing manual development with occasional AI assistance?

The contrast with most organizations is stark. Kiteworks reported in 2026 that 63% of organizations cannot enforce AI purpose limitations, and 60% cannot terminate misbehaving agents. These organizations have no operational metrics for their AI systems at all. They cannot measure what they cannot see. McCarthy’s $1,000 benchmark --- whatever you think of the specific number --- represents the opposite end of the maturity spectrum: operational metrics so precise that daily token spend per engineer is a tracked KPI.

The gap between “we cannot terminate a misbehaving agent” and “we track daily token spend per engineer” is the gap between organizations that use AI and organizations that operate AI. The Factory is firmly in the latter category.

What This Means for Your Operations

Step back from the specifics and a pattern emerges. Every component of the Factory maps to a recognized governance discipline:

Scenarios are policy enforcement. Behavioral specifications, stored independently from the code they govern, evaluated probabilistically. This is the same separation of duties that financial audit frameworks require --- the entity being audited does not control the audit criteria.

DTU is controlled testing. An environment designed so that dangerous actions are architecturally impossible, not just procedurally prohibited. This is defense in depth applied to the development process itself.

Weather Report is operational oversight. A published, routinely updated catalog of what AI systems are running, how they are configured, and how they perform. This is the SRE discipline applied to AI operations.

CXDB is auditable context. An immutable, content-addressed database that stores conversation histories and tool outputs in a directed acyclic graph. Every agent interaction is recorded, deduped, and retrievable. This is the audit trail.

Attractor is the orchestration layer. Published as natural language specifications --- three markdown files on GitHub --- that define how agents compose into pipelines. The orchestration logic is inspectable and reproducible. This is process documentation as executable architecture.

The lesson is clear, and it applies whether or not you agree with the “no human code” philosophy: governance that depends on human attention at the point of execution does not work at agent speed.

When agents generate code in minutes, a human review gate becomes a bottleneck that either slows everything down or gets rubber-stamped into irrelevance. When agents run thousands of scenarios per hour, a human cannot meaningfully oversee each one. The Factory’s insight is that you need governance mechanisms that operate at the same speed as the agents they govern. Scenarios validate faster than agents code. The DTU tests at volumes humans could never review. The Weather Report updates routinely without requiring someone to ask.

You do not need to adopt the “no human code” rule to benefit from this architecture. Most organizations will not --- and should not --- remove human review entirely. But every organization deploying AI agents at scale needs to grapple with the same question: how do you maintain governance when the speed of execution exceeds the speed of human oversight?

The answer, demonstrated by the Factory, is to encode governance into the system’s architecture rather than its personnel. Make policy machine-enforceable. Make testing environments structurally safe. Make operational oversight automatic and published. Make context auditable by default.

The most AI-radical company in the industry built the most disciplined governance framework. That is not a contradiction. It is a prerequisite.

Sources

StrongDM Factory. “Principles,” “Techniques,” “Products,” and “Weather Report.” factory.strongdm.ai, 2025-2026.
Nicholas Carlini. “Building a C compiler with Claude.” Anthropic Research Blog, February 2026.
Gartner. “Predicts 2026: 40% of firms will face shadow AI security incidents.” 2025.
Kiteworks. “2026 Forecast: 63% of organizations cannot enforce AI purpose limitations.” 2026.

Victorino Group helps organizations build the operational governance layer that makes autonomous AI sustainable. If you’re designing AI operations or need help implementing oversight mechanisms that work at machine speed, reach out.