The Week Agent Infrastructure Went Mainstream

We have spent the past two months tracking companies shipping containment infrastructure for AI agents. In February, four companies independently built sandboxes. In March, five more converged on the same architecture: YAML policies, egress proxies, credential isolation.

The containment question is settled. The next question arrived this week.

Between March 13 and March 22, four announcements landed that share nothing at the surface level. DigitalOcean published how Cloudways built an AI SRE managing 90,000 servers. The Kubernetes project introduced a new CRD for agent workloads. Grafana shipped built-in observability for MCP servers. And a synthesis piece on “software factories” revealed that Stripe now merges 1,300 agent-written pull requests per week.

Read them separately and each is an infrastructure story. Read them together and a pattern emerges: the industry is no longer debating whether agents belong in production. It is building the operational stack agents require to stay there.

CW Copilot: The Agent That Runs 90,000 Servers

Cloudways, DigitalOcean’s managed hosting platform, built CW Copilot over the past year. The system combines Claude Sonnet 4 (via DigitalOcean’s serverless inference) with Ansible playbooks and a Celery/Redis task queue. It manages over 90,000 servers and 500,000 applications.

The architecture is worth studying for what it reveals about production agent design. CW Copilot does not have root access. It runs as a dedicated Linux user with restricted permissions, connecting to servers through sequential SSH sessions. Every action maps to a predefined Ansible playbook. The LLM reasons about what to do; Ansible controls what it can do.

This separation matters. The agent has broad diagnostic capabilities (it can inspect logs, processes, configurations across the fleet) but narrow action capabilities (it can only execute pre-approved remediation playbooks). Cloudways reports that “AI-powered insights are significantly faster and more consistent than those provided by a human agent.” No quantitative metrics accompany that claim, which is notable for a system in production for over a year.

The missing metrics point to a broader problem. As we argued in The Governance Loop Hidden in Your Agent Monitoring, when organizations frame agent systems as productivity tools rather than governance systems, they measure speed instead of correctness. CW Copilot’s qualitative-only reporting suggests it was built as an efficiency play. Whether it also functions as a governance system is a question its builders have not yet answered publicly.

What CW Copilot demonstrates clearly: production agents need identity (a dedicated user), bounded authority (Ansible playbooks as guardrails), and orchestration (task queues that sequence work). These are not optional features bolted on later. They are architectural decisions made from day one.

K8s Agent Sandbox: Kubernetes Admits Containers Are Not Enough

The Kubernetes project’s new Agent Sandbox, announced March 20 under SIG Apps by Janet Kuo and Justin Santa Barbara, is the most architecturally revealing announcement of the four.

Kubernetes already runs AI inference workloads. Stateless inference is a solved problem: a request arrives, a model processes it, a response returns. Roughly 50 milliseconds, no state, no identity. Traditional pods handle this well.

Agents break every assumption in that model.

An agent runs for minutes or hours, not milliseconds. It accumulates state (conversation history, tool outputs, partial work). It needs a persistent identity so other systems can address it. It requires isolation stronger than what a standard pod provides, because an agent with tool access can execute arbitrary code. And it needs to suspend and resume without losing context, because keeping an idle agent running wastes expensive GPU-adjacent resources.

The K8s Agent Sandbox introduces three new primitives. The Sandbox CRD defines a singleton, stateful, isolated workload with its own identity. SandboxWarmPool pre-provisions pods so agents get instant allocation without cold starts. SandboxClaim lets workloads request a sandbox from the pool, similar to how PersistentVolumeClaim requests storage.

The isolation model uses gVisor or Kata Containers at the kernel level, plus network isolation policies. This goes beyond what we documented in the containment taxonomy: Kubernetes is not just adding sandboxing to an existing orchestrator. It is building a new resource type that treats agents as fundamentally different from services.

The lifecycle management is where the design gets interesting. Agents can suspend with full state preservation and resume later. They can scale to zero when idle. The warm pool eliminates the startup penalty that makes scale-to-zero impractical for latency-sensitive workloads.

A Python SDK (pip install k8s-agent-sandbox) provides the developer interface. The project is early (development stage, not production), but the signal is unmistakable: the Kubernetes community has decided that agents are a distinct workload class that requires distinct infrastructure.

Grafana MCP Observability: Two Lines of Code That Change Everything

Grafana’s March 20 announcement is the smallest in scope and possibly the largest in consequence.

Using OpenLIT integration, any MCP server can add production observability with two lines of code: import openlit and openlit.init(). That is the entire instrumentation. From those two lines, you get p95 and p99 latency tracking, tool invocation duration, context window usage monitoring, and error classification. Pre-built dashboards visualize tool performance, protocol health, and failure patterns. Both MCP client and server sides are instrumented.

The implementation uses OpenTelemetry, which makes it vendor-neutral. You can ship traces to Grafana, Datadog, or any OpenTelemetry-compatible backend.

Why does this matter? Because until now, MCP monitoring was ad hoc. Teams built custom logging, wrote their own dashboards, or simply did not monitor their MCP servers at all. As we explored in The Governance Loop, observability and governance are the same system viewed from different angles. You cannot govern what you cannot measure. You cannot measure what you have not instrumented.

Grafana’s move makes MCP the first major agent protocol with built-in, standardized observability. Not as a third-party add-on. Not as an enterprise feature. As a two-line integration that any developer can add in five minutes.

The pre-built dashboards are the quiet part. Dashboards encode assumptions about what matters. Grafana’s dashboards track tool invocation patterns, latency distributions, and context window consumption. These are not operational metrics. They are governance metrics. How often does tool X get called? How much context does it consume? When does it fail? These questions tell you whether your agent system is behaving within expected boundaries, which is a governance question wearing an observability label.

The Software Factory at Scale: 1,300 PRs Per Week

The fourth data point comes from a synthesis by Alex Opoien on what he calls “the software factory.” The headline number is Stripe’s. Their agent system, built on a curated MCP toolset called Toolshed (~500 tools selected per task context), now merges over 1,300 pull requests per week. All agent-written. Devbox environments spin up in 10 seconds. Business requests that took 10 to 14 days reach production in hours.

The governance architecture underneath those numbers is the story. Stripe uses what they call the Blueprint pattern: hybrid workflows where deterministic steps handle what can be specified precisely and agentic subtasks handle what cannot. A hard rule caps iteration: maximum two CI rounds before human review. If the agent cannot get the build green in two attempts, a human intervenes.

This is backpressure engineering. The system does not trust agents unconditionally. It gives them bounded autonomy with escalation triggers. As we analyzed in The Most Governed Software Factory, the organizations removing humans from code are not removing governance. They are encoding it into the system so deeply that manual checkpoints become redundant.

Boris Cherny, who works on Claude Code, offers a prediction: “We’re going to start to see the title of software engineer go away.” The claim is provocative but the data beneath it is concrete. Roughly 60% of non-engineers at companies adopting these tools now use AI for code daily. A 5-person team in 2026 can ship what a 50-person team shipped in 2016.

Whether or not the job title changes, the operational model already has. Software production at this scale requires the same infrastructure any production system requires: identity, orchestration, observability, and governance.

The Agent Infrastructure Stack

Read these four announcements together and a stack emerges:

Layer	Function	This Week’s Evidence
Identity	Persistent, addressable agent with bounded permissions	CW Copilot (dedicated Linux user), K8s Sandbox (Sandbox CRD)
Orchestration	Task sequencing, warm pools, lifecycle management	CW Copilot (Celery/Redis queue), K8s Sandbox (SandboxWarmPool, suspend/resume)
Observability	Latency, tool usage, context consumption, error tracking	Grafana MCP (OpenTelemetry, pre-built dashboards)
Governance	Policy enforcement, escalation triggers, audit trails	Stripe Blueprint (2-CI-round cap), CW Copilot (Ansible playbook boundaries)

This is not a framework we invented. It is a framework the industry built this week without coordinating on it.

Each layer addresses a failure mode that the containment stack alone does not cover. Containment answers “what can the agent access?” Identity answers “who is this agent and what is it allowed to do?” Orchestration answers “how do agents get work and how do they hand it off?” Observability answers “is the agent behaving within expected parameters?” Governance answers “what happens when it doesn’t?”

The containment work we tracked in February and March was the foundation. This week’s announcements are the floors being built on top of it.

What Changes for Enterprises

Three practical implications follow.

First, agents need their own infrastructure primitives. Kubernetes acknowledging this explicitly (agents are not services, they need a new CRD) validates what production teams have been discovering through trial and error. If you are running agents as regular containers or serverless functions, you are fighting the abstraction. Agent workloads have different lifecycle, state, and isolation requirements. Build for those requirements or accumulate operational debt.

Second, observability is no longer optional and no longer difficult. Grafana reduced MCP monitoring from “build a custom solution” to “add two lines of code.” The barrier to instrumentation has dropped to near zero. Organizations still running agents without production telemetry are choosing blindness, not managing complexity.

Third, the factory model works but only with governance built in. Stripe’s 1,300 PRs per week is not a story about removing humans from software. It is a story about encoding governance so precisely (Blueprint patterns, CI round caps, curated tool contexts) that agents can operate at scale within defined boundaries. The governance is the product. The velocity is a side effect.

The question is no longer whether your organization will run agents in production. It is whether your infrastructure treats them as first-class operational citizens, with identity, orchestration, observability, and governance. Or as scripts with API keys.

This analysis synthesizes CW Copilot: AI SRE for 90,000+ Servers (March 2026), K8s Agent Sandbox (March 2026), Grafana MCP Observability via OpenLIT (March 2026), and The Software Factory (March 2026).

Victorino Group helps organizations build the operational infrastructure that production agents require. Let’s talk.