Engineering Notes

Harness Engineering Is Not New — But Naming It Matters

TV
Thiago Victorino
10 min read

Ryan Lopopolo at OpenAI published an article last week claiming to describe a new engineering discipline: harness engineering. The premise is that building reliable AI agent systems requires a distinct set of skills — designing AGENTS.md files, enforcing invariants through linters, managing entropy, treating the repository as the single source of truth.

Every one of those practices is older than the LLM era. What OpenAI is actually describing is the moment an industry discovers that its new tools need the same discipline as its old ones.

This is not a dismissal. There is something genuinely useful happening here. But the useful part is not where OpenAI thinks it is.

The Claim vs. the Evidence

OpenAI’s team built an internal product over five months with Codex agents. Three engineers, growing to seven. Roughly one million lines of code. About 1,500 pull requests. Zero manually written code — their claim.

Let me stop at that last assertion. “Zero manually written code” depends on a narrow definition that excludes AGENTS.md files, linter configurations, custom enforcement rules, prompt engineering, architectural decisions, and the entire harness itself. These are engineering artifacts. They require engineering skill. Calling them “not code” because they don’t compile is the same rhetorical trick as saying a building architect doesn’t build buildings because they don’t lay bricks.

The productivity numbers — 3.5 PRs per engineer per day — are presented without context. No quality metrics. No defect rates. No cost analysis. No comparison against a non-agent baseline. And lines of code as a measure of AI-generated output is meaningless. GitClear’s 2025 analysis found an 8x increase in duplicated code blocks in AI-assisted repositories. More lines is not more value.

This matters because OpenAI is selling Codex. This article is a case study marketing piece wearing the clothes of engineering insight. That does not invalidate the lessons. But it should calibrate how much weight you give the conclusions.

What They Got Right

Strip away the novelty claims and there are three genuine insights in the article.

First, the documentation architecture. OpenAI tried a single monolithic instruction file. It failed. They moved to a compact AGENTS.md (around 100 lines) pointing to a structured docs/ directory. This is not new — it is the docs-as-code movement applied to a new audience. But the specific observation that agents need navigational documentation, not encyclopedic documentation, is worth internalizing. An agent cannot search. It can only follow pointers.

Second, entropy management. Agents replicate patterns, including bad ones. OpenAI initially spent 20% of engineering time — every Friday — cleaning up what they called “AI slop.” Their solution was to encode golden principles and run background garbage collection tasks that auto-refactor the codebase. This is the most honest admission in the piece: AI agents create entropy faster than humans do, and managing that entropy is a permanent cost, not a temporary one.

Third, the inversion of Brooks’s Law. In a traditional team, adding people adds communication overhead. OpenAI claims that with a sufficient harness, adding engineers increased throughput proportionally because the harness replaces human-to-human communication with human-to-system communication. This is a testable hypothesis and, if true, the single most consequential claim in the article. It deserves rigorous independent verification before anyone builds a staffing strategy around it.

What They Got Wrong

The framing. Harness engineering is not a new discipline. It is a convergence discipline — a name for the integration of practices that already existed separately.

Consider the components:

  • Structured documentation for automated consumers: Infrastructure as Code, runbooks, API specs. Decades old.
  • Enforcing invariants through linters and structural tests: Static analysis, CI/CD pipelines, pre-commit hooks. Decades old.
  • Repository as single source of truth: GitOps, configuration management, docs-as-code. At least fifteen years old.
  • Custom linting with remediation guidance: ESLint with fix suggestions, RuboCop autocorrect, compiler error messages with hints. Not new.
  • Background automated cleanup: Scheduled refactoring, dependency updates, code formatting enforcement. Not new.

What is genuinely different is the target audience. These practices were designed for human developers. Adapting them for LLM agents requires real engineering judgment — different constraints, different failure modes, different feedback mechanisms. But adaptation is not invention.

The DevOps Analogy

Harness engineering is to the agentic era what DevOps was to the cloud era.

DevOps did not invent continuous integration, automated testing, infrastructure automation, or monitoring. Those practices existed. What DevOps did was name the convergence — the recognition that development and operations needed to be a single discipline rather than two teams throwing artifacts over a wall.

The naming mattered enormously. Once “DevOps” existed as a concept, organizations could hire for it, budget for it, build teams around it, and measure it. The name created organizational infrastructure that the individual practices alone could not.

Harness engineering is following the same trajectory. The practices exist. What is arriving is the name, and with it, the organizational recognition that someone needs to own the system that agents operate within.

This is why the naming matters more than the novelty. You do not need harness engineering to be new for it to be important. You need it to be named so that organizations can invest in it deliberately rather than discovering it accidentally.

Independent Convergence as Validation

The strongest evidence that something real is happening here is that multiple organizations arrived at the same patterns independently.

Anthropic published “Effective Harnesses for Long-Running Agents” in November 2025. Their recommended pattern — an Initializer that boots the agent with session context, followed by a Coding Agent that executes — mirrors OpenAI’s architecture without sharing its lineage. Same conclusion, different team, different product.

AGENTS.md, the documentation format OpenAI describes, has been adopted by over 60,000 projects and is now under the Linux Foundation’s AI Agent Infrastructure Foundation (AAIF) as of December 2025. This is not one company’s convention. It is an emerging standard.

Manus, the AI agent startup, rewrote their harness five times in six months. Not because they were incompetent, but because the constraints of agent orchestration are genuinely different from traditional software orchestration and there are no established best practices yet.

When three companies independently reach the same architectural conclusions, that is not marketing. That is convergent evolution driven by shared constraints. The practices are validated. The novelty is overstated.

The Entropy Problem Is the Real Story

If you take one idea from OpenAI’s article, take the entropy problem.

Traditional software has a natural drag on entropy. Human developers resist repetition because it bores them. They refactor because duplication feels wrong. They push back on bad patterns because their professional identity is tied to code quality.

AI agents have no such resistance. An agent will replicate a bad pattern a thousand times with the same enthusiasm as a good one. It will not notice that three modules use slightly different logging formats. It will not feel uneasy about a file that has grown too large. Entropy is not a bug in agent-generated code. It is the default state.

This means that any team operating AI agents at scale needs a permanent entropy management practice. Not a one-time cleanup. Not a quarterly refactoring sprint. A continuous, automated, systematic process for detecting and correcting drift.

OpenAI’s “garbage collection” metaphor is apt. In languages with garbage collection, memory management is not optional and it is not a one-time task. It runs continuously in the background. Entropy management for AI-generated code is the same kind of permanent overhead.

The organizations that understand this will budget for it. The ones that do not will accumulate technical debt at a rate they have never experienced before. LangChain’s 2025 survey found that 32% of organizations cite quality as their top barrier to scaling AI agents. This is the entropy problem wearing a business suit.

What This Means for Teams Adopting AI Agents

Do not wait for the discipline to mature before starting. The practices are known even if the name is new. If you are using AI agents in development, you need documentation designed for agents, not humans. You need linters that enforce invariants with remediation instructions in the error messages. You need automated entropy detection. You need these things now, not after the industry agrees on what to call them.

Treat the harness as a first-class engineering artifact. Your AGENTS.md, your linter rules, your CI enforcement, your cleanup automation — these are not overhead. They are the product. In an agent-first workflow, the harness determines output quality more than the model does.

Be skeptical of productivity claims that lack quality metrics. 3.5 PRs per engineer per day means nothing without defect rates, rework rates, and long-term maintenance costs. Gartner predicts 40% of agentic AI projects will be canceled by 2027. The ones that survive will be the ones that measured quality, not just velocity.

Budget for entropy management as a permanent line item. If 20% of engineering time is the steady-state cost of managing AI-generated code quality — and OpenAI’s experience suggests it might be — that needs to be in your capacity planning. It is not a sign of failure. It is the cost of operating agents at scale.

Look at Spotify, not just OpenAI. Spotify reports roughly 50% full automation on over 1,500 merged AI pull requests, with documented failure modes and honest assessment of limitations. That is a more useful benchmark than a vendor selling its own product.

The Discipline Exists. The Investment Does Not.

Harness engineering is real. It is a genuine set of skills, practices, and architectural decisions that determine whether AI agents produce value or produce mess. But it is not new in the way OpenAI implies. It is the recognition that environment design, invariant enforcement, documentation architecture, and continuous quality management — practices engineers have refined for decades — need to be applied deliberately, systematically, and permanently to a new class of consumer: the LLM agent.

The gap is not in knowledge. Engineers know how to write linters, structure documentation, enforce CI gates, and automate refactoring. The gap is in organizational recognition — in treating these practices as a named, funded, staffed discipline rather than something individual engineers figure out on their own.

That is why the naming matters. Not because the ideas are new. Because the investment is overdue.


Sources

  • Ryan Lopopolo. “Harness engineering: leveraging Codex in an agent-first world.” OpenAI, February 2026.
  • Anthropic. “Effective Harnesses for Long-Running Agents.” anthropic.com, November 2025.
  • Linux Foundation. AI Agent Infrastructure Foundation (AAIF) announcement. December 2025.
  • GitClear. “AI Code Quality in 2025: Measuring the Impact of Copilot and Beyond.” gitclear.com, 2025.
  • LangChain. “State of AI Agents 2025.” langchain.dev, 2025.
  • Gartner. “Agentic AI project cancellation forecast.” gartner.com, 2025.
  • Phil Schmid / Aakash Gupta on Manus harness rewrites. 2025.
  • Spotify Engineering. AI-assisted PR automation metrics. 2025.

At Victorino Group, we build the governance layer that turns AI agents from experiments into infrastructure — harness engineering included. If your agents produce output but not confidence, let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation