Operating AI

Agent Teams and the Shift from Writing Code to Directing Work

TV
Thiago Victorino
10 min read

A researcher at Anthropic pointed 16 instances of Claude Opus 4.6 at a problem: build a C compiler in Rust. Two weeks later, the result was a 100,000-line codebase that compiles the Linux kernel across three architectures, passes 99% of the GCC torture test suite, and handles production software including QEMU, FFmpeg, SQLite, PostgreSQL, and Redis.

The cost was roughly $20,000.

This is not a story about a powerful model. It is a story about what happens when you stop using AI as a typing assistant and start using it as a workforce.

The Environment Is the Product

Nicholas Carlini, the researcher who ran the project, published a detailed account of what worked and what didn’t. His most important observation was not about the model. It was about the work that surrounded the model.

Most of his effort went into designing the environment: the test infrastructure, the feedback loops, the task boundaries. Not the prompts. Not the model configuration. The environment.

This inverts the conventional wisdom about AI development. The industry obsesses over prompt engineering. Carlini’s experience suggests that environment engineering --- the design of constraints, tests, and feedback mechanisms that shape autonomous work --- matters more.

The distinction is important. A prompt tells an agent what to do. An environment tells an agent how to know whether it succeeded. When agents operate autonomously for hours, the environment is the only thing maintaining quality. No human is watching each decision. The test suite is.

Test Suites as Specifications

Here is the non-obvious insight: when agents are autonomous, your tests literally define what gets built.

Carlini used what he calls the “compiler oracle” technique. He didn’t write a specification document describing how the C compiler should behave. He pointed the agents at an existing C compiler (GCC) and said: produce the same output for every input. The oracle --- the existing compiler --- became the specification. The test suite became the contract.

This changes the testing conversation entirely. In traditional development, tests verify that code meets a spec. In autonomous development, tests ARE the spec. There is no other document. The agents read the tests, write code, run the tests, and iterate until they pass.

The implication for organizations: if your test suite is incomplete, ambiguous, or poorly maintained, autonomous agents will build software that matches your tests --- not your intentions. Bad tests don’t just miss bugs. They actively encode the wrong behavior.

Every organization that plans to use autonomous AI agents needs to reckon with this. The quality of your test infrastructure is no longer a nice-to-have engineering practice. It is the primary input to what your software becomes.

From Pair Programming to Project Management

Claude Code’s new Agent Teams feature makes multi-agent coordination a first-class capability. Instead of a single AI assistant helping you write code, you can now run a team: a lead agent that coordinates work, teammate agents that execute independently, a shared task list, and a message system for coordination.

Each teammate has its own context window. Each operates independently. They can message each other directly. The lead delegates, reviews, and coordinates.

This is architecturally different from subagents, where a parent spawns a child that reports back results. Agent teams collaborate. They claim tasks, flag blockers, request reviews from peers, and coordinate across workstreams. The metaphor is not a function calling a subroutine. The metaphor is a project manager directing a team.

For engineers, this changes the nature of the work. The skill is no longer writing code alongside an AI. The skill is defining work packages, setting acceptance criteria, designing review processes, and resolving conflicts between autonomous workers. The role shifts from pair programmer to engineering manager --- except the team works at machine speed and the manager needs to keep up.

The Coordination Problem

Carlini’s 16-agent compiler project surfaced a problem that every multi-agent deployment will face: coordination at scale.

Sixteen agents working on the same codebase means sixteen agents that can create merge conflicts, duplicate work, make contradictory design decisions, and break each other’s code. Carlini needed locking mechanisms so agents wouldn’t edit the same files simultaneously. He needed task-claiming protocols so two agents wouldn’t solve the same problem. He needed merge conflict handling that didn’t require human intervention at 3am.

This is governance infrastructure. Not the abstract kind that lives in policy documents. The operational kind that determines whether sixteen agents produce a coherent system or sixteen incompatible fragments.

The Agent Teams feature in Claude Code addresses some of this with built-in task management, delegation, and plan approval. But the fundamental challenge remains: autonomous agents need rules, boundaries, and coordination mechanisms. The more agents you run, the more governance you need. This scales linearly at minimum --- and possibly super-linearly as interactions between agents grow.

Organizations that struggle to coordinate human engineering teams should think carefully about whether adding AI agents to the mix simplifies or compounds the problem. The answer depends entirely on the governance infrastructure they build first.

The Economics of Autonomous Development

The $20,000 question deserves a direct answer.

A 100,000-line compiler that passes 99% of a standard test suite, built in two weeks: what would this cost with human developers? A conservative estimate for a team of senior systems programmers working on a C compiler would be several months of work at minimum, likely six to twelve months for a team of four to six engineers. At market rates for compiler engineers, that is $500,000 to $1,500,000 in salary and overhead.

The cost reduction is between 25x and 75x.

But the economics are more nuanced than a ratio. The $20,000 figure reflects roughly 2 billion input tokens and 140 million output tokens across 2,000 sessions. Opus 4.6 is priced at $5 per million input tokens and $25 per million output tokens. If those prices drop --- and they will --- the economics only improve.

The more interesting economic question is not “is this cheaper?” but “what becomes economically viable?” A company that would never fund a twelve-month compiler project might fund a two-week, $20,000 experiment. The reduction in cost doesn’t just make existing projects cheaper. It makes previously impossible projects feasible. This is the expansion effect that Karpathy and Willison have both described --- and Carlini’s project is the most dramatic demonstration of it to date.

What Opus 4.6 Changes

The compiler project ran on Opus 4.6, released February 5, 2026. The model deserves attention not for its benchmark scores --- though it leads on Terminal-Bench 2.0 for agentic coding and outperforms GPT-5.2 by roughly 144 Elo on GDPval-AA --- but for three capabilities that matter for autonomous work.

Adaptive thinking. The model adjusts its reasoning depth based on task complexity. Simple tasks get fast responses. Complex architectural decisions get deeper analysis. This matters when agents run for hours: wasting reasoning on trivial decisions burns tokens and time. Effort controls let developers tune this explicitly.

One million token context window. Currently in beta, this allows agents to hold significantly more of a codebase in context at once. For a compiler project spanning hundreds of files, the difference between 200K and 1M tokens of context is the difference between an agent that understands a module and an agent that understands the system.

Context compaction. Also in beta, this feature lets the model compress its context during long sessions without losing critical information. For multi-hour autonomous runs, this addresses the practical problem of context exhaustion --- the point where an agent’s context window fills up and it starts losing track of earlier decisions.

These are not flashy features. They are infrastructure features. They make sustained autonomous work practical rather than theoretical.

The Specialization Emergence

One of Carlini’s most interesting observations: the agents developed informal specialization without being told to.

He started with general-purpose agents. Over the course of the project, patterns emerged. Some agents became better at certain subsystems. Some developed effective strategies for particular types of problems. The specialization wasn’t programmed. It emerged from the interaction between agent capabilities, task assignment, and the feedback from tests.

This echoes what we see in human organizations. Put a team on a problem, and people naturally gravitate toward areas where they’re effective. The mechanism is different with AI --- there’s no preference or enjoyment, just differential effectiveness --- but the outcome is similar.

The practical lesson: don’t over-specify agent roles upfront. Define the work. Define the acceptance criteria. Let the coordination mechanism handle assignment. Rigid role definitions may prevent the emergent specialization that makes multi-agent systems effective.

What This Means For Your Organization

Invest in test infrastructure before investing in agents. If your tests are incomplete, agents will build software that matches your incomplete tests. The test suite is the specification. Treat it as such.

Start thinking about environment design. The most impactful skill for autonomous AI development is not prompt engineering. It is designing the constraints, feedback loops, and quality gates that shape agent behavior. This is a governance discipline, not a coding technique.

Rethink what “engineering skill” means. The value of an engineer is shifting from the ability to write code to the ability to define work, review output, and maintain architectural coherence. This has implications for hiring, training, and performance evaluation.

Build coordination infrastructure before scaling agent count. Sixteen agents without governance produce chaos. Two agents with strong coordination mechanisms produce reliable software. The infrastructure comes first.

Run the economics on projects you’ve deferred. The cost structure of autonomous development makes a category of projects viable that weren’t before. The question is not whether AI development is cheaper. The question is what you can now build that you couldn’t justify before.

The compiler project is a proof point, not an outlier. The tools are available. The economics work. The question --- as always --- is whether your organization has the governance infrastructure to use them responsibly. Autonomous agents don’t reduce the need for engineering discipline. They amplify the consequences of its absence.


Sources

  • Nicholas Carlini. “Building a C compiler with Claude.” Anthropic Research Blog, February 2026.
  • Anthropic. “Claude Opus 4.6” and “Agent Teams in Claude Code.” anthropic.com, February 5, 2026.
  • Andrej Karpathy. “A few random notes from claude coding.” X/Twitter, January 2026.
  • Simon Willison. “No, AI is not Making Engineers 10x as Productive.” simonwillison.net, August 2025.
  • METR. “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.” metr.org, July 2025.

Victorino Group helps organizations build the governance layer that turns autonomous AI capability into production-grade engineering outcomes. If you’re evaluating multi-agent development or need help designing the environment that makes it work, reach out.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation