The Software Factory Needs a Scoreboard

Qudrat Ullah’s freeCodeCamp piece “How to Build a Software Factory with Claude Code” is the cleanest practitioner-grade playbook published this year for how one developer plus Claude Code becomes a coordinated team. Seven specialized subagents. Three human approval gates. Hooks as deterministic enforcement. CLAUDE.md as the contract layer. Two to three hours to stand up the first version. Most posts about “agentic coding” talk in metaphors; Ullah ships a buildable specification.

We agree with almost every primitive. The architecture is sound. What follows is not a critique. It is the natural next move on top of his chain: the measurement axis the playbook leaves open.

Two Factories, One Word

The word “factory” now points at two different objects, and both are real.

There is the org-level factory: StrongDM’s manifesto that no human writes code and no human reviews code, with governance disciplines (scenarios, DTU, Weather Report, CXDB) embedded as architecture. That is a company-shaped factory, and we have written about how its disciplines work as policy-in-architecture.

Ullah’s piece describes the other factory: the repo-level one. One developer, one machine, Claude Code as the workforce, a CLAUDE.md file as the constitution. Both are governance objects. They are not in tension. The org-level factory tells you how a whole engineering organization makes the chain accountable. The repo-level factory tells you how an individual contributor turns that same idea into a working harness inside a single codebase.

The rest of this piece is about the repo-level one, because that is what Ullah specified, and that is what most teams will adopt first.

What Ullah Got Exactly Right

A short inventory of the primitives we endorse without modification:

Seven roles with scoped tool access. A codebase-researcher that only reads. A story-writer and spec-writer that produce artifacts. A backend-builder, frontend-builder, test-verifier, and implementation-validator that do the work and check the work. Each subagent has a job description and a tool allowlist. The roles are different because the tools are different, which is the only honest reason to split agents.

Three human approval gates. After the story. After the spec. Before the PR. Ullah is explicit that these are non-negotiable. The mantra he closes on, “Cloud AI is not a teammate. Accountability stays with the human,” is the right frame. Approval gates are not friction; they are where authorship lives.

Hooks as deterministic enforcement. Ullah’s hook recommendations align with the deterministic-shell thesis we covered when Dabit and Wolfe published in the same week. The six lifecycle events are the right primitives. A SessionStart hook that loads project context. A PreToolUse hook that blocks edits to forbidden paths. A Stop hook that refuses completion until tests pass. Code, not persuasion.

“Use AI to automate structured work, not chaotic work.” The single best one-line discipline mantra of the year. It rules out the entire class of “let the agent figure it out” prompts that produce the failure modes everyone complains about.

Restart the conversation. Ullah names context drift as a real failure mode and prescribes the simplest possible fix: throw the session away, start clean, let the artifacts carry the state. That is a practitioner’s instinct, and it is correct.

CLAUDE.md sized at 100 to 300 lines. Specialized agents, not god agents. Artifacts as the medium of state. Every one of these is a load-bearing decision and Ullah names them all.

The Missing Axis: Is the Factory Improving?

Here is where building forward starts.

Ullah teaches you how to assemble the factory. He does not teach you how to know if it is improving. There is no chapter on instrumentation. After two to three hours of setup, you have a working chain. After two months of use, you have no idea whether the chain is getting better, drifting, or quietly burning your time.

This is the centaur measurement problem applied one layer down. We have argued before that the right unit of measurement in agentic software development is the pair, not the model: human plus AI as a single performance object. Ullah’s factory is a centaur. Seven subagents plus three approval gates plus one human accountable for the PR. To know if that centaur is getting faster, more reliable, more honest, you need numbers on top of the chain.

Without those numbers, the seven-agent factory is a more disciplined vibe coding loop, not yet a learning system. It is faster than freestyle prompting. It still drifts in the same direction freestyle prompting drifts in, and you cannot prove otherwise.

Three Numbers to Instrument

The three axes below are not exhaustive. They are the minimum a factory operator should put on a CSV next to the repo, starting Monday.

1. Per-agent reliability

When the implementation-validator flags a critical defect, which builder produced it? When the test-verifier surfaces a flaky test, was it written by the frontend-builder or the backend-builder? Without per-agent attribution, model routing is a guess and prompt updates are theology.

Operational definition: for every PR that closes, log which subagent authored the artifact that the next gate rejected or accepted. After thirty PRs, you have a per-agent acceptance rate. After ninety, you can see which agent’s CLAUDE.md section needs sharpening, and which one is already pulling its weight. That is the loop Ullah’s chain does not yet close on its own.

2. Approval-gate latency

Three human gates means three queues. Track the wall-clock time the human takes at each one: story review, spec review, PR review. If story review averages 90 seconds and PR review averages 45 minutes, the human bottleneck is at the wrong layer. The story gate is doing nothing; the PR gate is doing the work the earlier gates were supposed to absorb.

This is the metric that tells you whether the upstream artifacts (story, spec) are actually carrying their weight. A healthy factory has approval-gate latency that decreases as you move downstream, because each gate filtered more of the work the next one would have had to redo. An unhealthy factory has all three gates compressed into a single ninety-minute PR session, which means the story and spec rounds were ceremony.

3. Restart frequency

Ullah names “throw the conversation away” as healthy. He gives no rate at which it becomes pathological. We propose one: count it.

Restart frequency is the number of sessions per week that get reset before producing a PR. If it sits at one or two per week, you are running a normal factory and clearing context noise the way Ullah recommends. If it climbs to one per day, your CLAUDE.md or your agent definitions are leaking, and what looks like productive work is actually a context-reload treadmill. The restart becomes the bug, not the fix.

Track it. Plot it weekly. A factory restarting daily is debugging a context problem, not building software.

What This Costs to Add

Be honest with yourself before you reach for a vendor: none of the three numbers requires one.

A CSV next to your repo. A PostToolUse hook that timestamps each approval and writes one row per gate. A weekly fifteen-minute review of validator output grouped by which builder produced it. Two hours of operator discipline per week, on top of the two-to-three hours Ullah spec’d for standing up the factory in the first place. The marginal cost of instrumentation is rounding error against the cost of a factory you cannot prove is improving.

You can add a dashboard later. Start with the CSV.

Closing

The cleanest line from Ullah’s piece is the one that has already started showing up in other people’s posts: “The teams that get there first will not be the ones with the best AI tools. They will be the ones who built the cleanest factories around the AI tools they already have.”

That is exactly right, and it has one honest amendment. The cleanest factory wins only if you can prove it is getting cleaner. An instrumented factory is a learning factory. An uninstrumented one is a faster vibe coding loop with more files in it.

Ullah drew the assembly line. The scoreboard goes above it. Build the chain this month. Hang the numbers on the wall next month. By the third month you will know whether the centaur in your repo is actually getting faster, or whether you have just been busier in better-organized ways.

This analysis synthesizes How to Build a Software Factory with Claude Code (freeCodeCamp / Qudrat Ullah, May 2026) and How to Unblock the AI PR Review Bottleneck (freeCodeCamp / Qudrat Ullah, 2026).

Victorino Group helps engineering teams instrument their agent factories so the chain that ships code can prove it is getting better. Let’s talk.