- Home
- The Thinking Wire
- Generator-Evaluator Loops: What Anthropic's Harness Design Actually Teaches
Generator-Evaluator Loops: What Anthropic's Harness Design Actually Teaches
Anthropic published a new harness design for long-running applications. Three agents. A planner, a generator, an evaluator. Built on the Claude Agent SDK. Two case studies: a retro game maker and a digital audio workstation. Cost figures included.
The article contains exactly one breakthrough insight, two useful engineering patterns, and a misleading analogy that needs correcting. Let me separate them.
The GAN Analogy Is Wrong
Prithvi Rajasekaran frames the generator-evaluator architecture as “inspired by GANs.” This framing is structurally incorrect and worth addressing because it sets wrong expectations about how these systems learn.
In a GAN, the generator and discriminator are jointly trained. Gradients flow from the discriminator back to the generator. The discriminator improves at detecting fakes; the generator improves at producing them. There is convergence theory (fragile, but real). There is a mathematical game being played.
None of that happens here.
In Anthropic’s architecture, the generator produces code. The evaluator runs Playwright tests and reports pass/fail. The generator reads the report and tries again. There is no gradient flow. No weight updates. No adversarial training dynamics. The “improvement” comes from the generator receiving new context (the failure report) and trying a different approach within the same context window. When the context window fills up, you reset and start over.
This is iterative generate-and-test with role separation. A useful pattern. But calling it GAN-inspired is like calling a spell-checker “inspired by neural machine translation” because both involve text correction. The mechanism is entirely different.
Why does this matter? Because the GAN framing implies emergent improvement. It suggests the system gets better over time through some training dynamic. It does not. Each attempt is an independent sample from the model’s distribution, conditioned on the accumulated context. The evaluator does not train the generator. It provides information that the generator uses within a single session.
As we examined in The Architecture of Multi-Agent Systems, adversarial feedback loops (like the Grandpa Loop’s review agents) provide real quality improvement. But they work through structured review and backpressure, not through any analogy to adversarial training. Calling every feedback loop a “GAN” obscures the actual mechanism.
Sprint Contracts Are ATDD for Agents
The second pattern is more useful. Anthropic introduces “sprint contracts”: before each implementation sprint, the generator proposes scope and explicit success criteria. The evaluator validates these criteria before any code is written. Each sprint has hard pass/fail thresholds.
If you have written acceptance tests before deployment, this will sound familiar. Acceptance Test-Driven Development (ATDD) and Behavior-Driven Development (BDD) have used this pattern for over twenty years. Write the acceptance criteria first. Implement until they pass. The criteria define “done.”
The application to AI agent workflows is the contribution. When a human developer writes acceptance criteria, the implementation happens in a mind that understands the criteria implicitly. When an AI agent implements against acceptance criteria, the criteria must be explicit, testable, and machine-readable. The sprint contract forces that translation.
There is a subtlety here that Anthropic does not emphasize enough. The planner agent writes specifications “without over-specifying.” This is the hard part. Over-specified plans constrain the generator into suboptimal paths. Under-specified plans let it wander. The planner must describe outcomes without prescribing implementation. That is a genuine engineering challenge, and it maps directly to how we think about agent orchestration in production: the orchestration layer describes what must happen without dictating how each agent achieves it.
Sprint contracts also create natural checkpoints for context management. Each sprint boundary is an opportunity to compress history, discard irrelevant context, and reset the working memory. This is not accidental. It is the most practical feature of the pattern.
Context Decay Is Real, and Longer Windows Do Not Fix It
The most independently verifiable claim in the article: models exhibit what Rajasekaran calls “context anxiety.” As the context window fills, the model begins prematurely wrapping up tasks. It rushes to conclusions. It simplifies. It cuts corners.
This matches independent research. Morphllm.com attributes 65% of enterprise AI failures to context drift. Multiple teams have reported similar degradation patterns. The finding is real.
Anthropic’s solution is structured context resets rather than simple context compaction. Instead of summarizing the full conversation into a shorter version (which loses critical details), the harness creates clean sprint boundaries where accumulated context is archived and the model starts fresh with only the current sprint’s contract and relevant prior results.
The implication runs counter to the marketing narrative around larger context windows. A million-token context window does not prevent decay. It delays it. The model still degrades as context accumulates. More room means more time before degradation becomes visible, but the fundamental problem remains: models process recent context more effectively than distant context, and accumulated noise compounds.
This has a practical consequence for anyone building long-running agent systems. You need a context management strategy. Not “use a bigger window.” A strategy. As we documented in The Harness Difference, LangChain’s Terminal Bench found that local context middleware (selectively surfacing relevant portions instead of dumping everything) produced a 13.7-point improvement. Context discipline beats context capacity.
The Cost Data: Interesting but Not Evidence
Rajasekaran shares cost figures from the two case studies. The Retro Game Maker cost $9 over 20 minutes with a simple harness and produced a broken game. With the full three-agent harness, it cost $200 over six hours and produced a functional game. The DAW project cost $124.70 over three hours and 50 minutes with a simplified harness approach.
A 22x cost increase for a qualitative leap from “broken” to “functional.” That ratio is interesting as a single observation. It is not a benchmark.
These are individual runs. No variance analysis. No comparison against simpler alternatives (what would a single agent with better prompting achieve?). No repeated trials to establish reliability. The retro game went from broken to functional, but we do not know whether the $200 run was typical, lucky, or one of many attempts.
This is the challenge with Anthropic writing about Anthropic products. The source selection bias is structural. The team chose which projects to highlight and which results to share. The failures, the mediocre runs, the cases where the harness added cost without adding quality: those do not appear in a blog post. This does not mean the results are fabricated. It means they are curated.
For teams evaluating whether to invest in multi-agent harnesses, the cost data suggests a useful heuristic: expect an order-of-magnitude cost increase for complex harnesses. Whether the quality improvement justifies that cost depends entirely on your use case. A $200 game prototype is cheap. A $200-per-run production pipeline is expensive. Context matters more than the number.
”Harness Complexity Moves with the Frontier”
The article’s most thought-provoking observation: as models improve, the harness complexity required for previously difficult tasks decreases. But the frontier of what you attempt moves forward, and frontier tasks require new harness complexity. “The space of interesting harness combinations doesn’t shrink. It moves.”
This aligns with what we observed in Harness Engineering Is Not New: Manus rewrote their harness five times in six months not because the harness was bad, but because the models improved and the harness had to evolve. The harness is a living system that co-evolves with the models it wraps.
Rajasekaran’s framing adds a useful dimension. The total investment in harness engineering does not decrease as models improve. It redistributes. Simple tasks need less scaffolding. But the tasks you tackle become harder, and the harder tasks demand novel scaffolding.
This borders on tautology (harder things are harder), but the practical implication is not obvious. Teams that budget for harness engineering as a one-time cost will be surprised when the bill recurs. The investment is permanent. What changes is where you spend it.
Self-Evaluation Bias: The Confirmed Finding
One finding from the article has strong independent verification: agents consistently overrate their own output. Rajasekaran notes that the external evaluator catches defects the generator misses. This matches Anthropic’s own “Building Effective Agents” guidance and independent research from Chroma.
But there is a limitation Rajasekaran acknowledges: the evaluator shares the same model’s blindspots. An evaluator built on the same model as the generator will miss the same categories of errors. The solution (calibrating the evaluator with few-shot examples of common failure modes) is essentially manual injection of human judgment into the loop. This works, but it means the system’s ceiling is bounded by the quality of those examples.
The honest conclusion: generator-evaluator separation improves quality over self-evaluation. It does not eliminate evaluation bias. The evaluator is better than self-grading but worse than human review. For production systems, this means the evaluator reduces the volume of human review needed. It does not replace it.
What Teams Should Take from This
Strip away the GAN analogy and the single-run cost figures, and three patterns survive scrutiny.
Sprint contracts work for agent orchestration. Explicit, machine-readable success criteria at each implementation boundary. This is not new to software engineering, but applying it to AI agent workflows is a genuine contribution. If your agents run multi-step workflows, define acceptance criteria per step. Not as guidelines. As hard pass/fail gates.
Context resets beat context expansion. Invest in structured context management rather than relying on larger windows. Sprint boundaries are natural reset points. Archive what matters, discard what does not, start each phase with a clean working set.
Separate generation from evaluation, but calibrate the evaluator. The generator-evaluator pattern produces better results than self-evaluation. But an uncalibrated evaluator is only marginally better. The value comes from injecting specific, concrete examples of what “good” and “bad” look like for your domain.
None of these patterns require the Claude Agent SDK. None require three agents. The principles apply to any multi-step AI system: define done before starting, manage context deliberately, and never let an agent grade its own homework without a second opinion.
This analysis synthesizes Harness Design for Long-Running Application Development (Anthropic, March 2026), Building Effective Agents (Anthropic, 2025), and LangChain Terminal Bench 2.0 (LangChain, February 2026).
Victorino Group designs the harness architecture that turns long-running AI agents into reliable production systems. Let’s talk.
All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →
If this resonates, let's talk
We help companies implement AI without losing control.
Schedule a Conversation