The Infrastructure Between Agent Demos and Agent Operations

Nader Dabit published new data from inside Cognition this week. The number that matters: 70% of Devin sessions are still triggered by a human.

When we analyzed Dabit’s cloud agent thesis in February, the argument centered on four properties that make cloud agents a distinct category. Accessibility, cross-repository capability, asynchronous execution, organizational scale. Those properties are real. The operational reality is that seven out of ten times, a person still types the prompt that starts the work.

Dabit projects this will flip to 10% human-triggered, 90% automatic within a year. That projection is credible. It is also where the infrastructure conversation gets serious.

The Relay Problem

Dabit describes the current state bluntly: “That person is acting as a relay between two systems that could talk directly.” A monitoring system detects a failure. It pages a human. The human reads the alert, opens Devin, writes a prompt describing the problem, and waits for the agent to act. The human adds no intelligence to this chain. They translate machine-readable signals into natural language so another machine can process them.

“Once you see it that way, the prompt is the bottleneck.”

Removing that bottleneck requires infrastructure that most organizations have not built. The agent needs to receive signals directly from monitoring systems. It needs authority to act on those signals within defined boundaries. It needs to escalate when the situation exceeds its scope. And someone needs to audit what happened after the fact.

Dabit calls this scaffolding “the difference between an agent that opens a PR and an agent that closes an incident at 3am.” The word choice is precise. Scaffolding is temporary structure that enables construction. The permanent structure is governance.

Bessemer’s Five Frontiers

The same week, Bessemer Venture Partners published their AI infrastructure roadmap for 2026. Bessemer’s portfolio includes Anthropic and Cursor, so their perspective carries the weight of direct exposure to how these systems fail in production.

They identify five infrastructure frontiers: harness infrastructure (the orchestration layer around models), continual learning systems (agents that improve from experience), reinforcement learning platforms, inference optimization, and world models. The frontier that matters most for operations teams is the first one: harness infrastructure.

The harness is everything between the model and the production environment. Retrieval systems, tool orchestration, evaluation pipelines, guardrails. Bessemer’s analysis found 23 companies building RL platforms alone. The market is fragmented because the problems are unsolved.

One data point from their research deserves isolation: 78% of AI failures are invisible. A study published at arxiv.org (2603.15423) found that the vast majority of agent failures produce no error, no alert, no signal that anything went wrong. The agent completes its task. The output looks plausible. The mistake surfaces days or weeks later, buried in downstream consequences.

Worse: 93% of these failure patterns persist even with more powerful models. The failures do not stem from capability limits. They stem from interaction dynamics. Three patterns recur. The confidence trap, where agents produce authoritative-sounding wrong answers. Drift, where agent behavior shifts gradually as context accumulates. Silent mismatch, where the agent’s interpretation of a task diverges from the human’s intent without either side detecting the divergence.

More compute does not fix interaction failures. Governance does.

The Missing Primitives

Bessemer names the absence explicitly: “rollback mechanisms and governance primitives that don’t yet exist in standard ML workflows.”

This is an investment firm telling its portfolio companies and their customers that the governance infrastructure for autonomous agents has not been built yet. Full lineage tracking for continual learning. Isolation techniques for safe experimentation. Rollback capabilities when an agent’s learned behavior turns out to be wrong.

In The Week Agent Infrastructure Went Mainstream, we mapped the emerging production stack: identity, orchestration, observability, governance. Kubernetes introduced a new CRD for agent workloads. Grafana shipped MCP observability in two lines of code. Stripe revealed 1,300 agent-written PRs per week with governance encoded into every step.

Those are the building blocks. Bessemer’s roadmap reveals how much construction remains. The announcements we tracked were the first floor. Bessemer counts at least five more floors that need to exist before agents operate autonomously at scale.

From 70/30 to 10/90

The transition Dabit projects (from 70% human-triggered to 10%) is not a product feature. It is an infrastructure migration.

Consider what changes when an agent receives work directly from a monitoring system instead of from a human prompt. The human relay, for all its inefficiency, provides implicit governance. A person reads the alert, applies judgment about severity, decides whether the agent should handle it, and frames the task in a way that constrains the agent’s scope. Remove the human and every one of those functions needs an infrastructure replacement.

Severity classification becomes a routing policy. Scope constraint becomes a permission boundary. Task framing becomes a prompt template maintained by the operations team. Judgment about whether the agent should handle a given situation becomes an escalation threshold defined in configuration, not intuition.

As we argued in The Agent Operations Paradox, each agent you add multiplies output and multiplies operational load. The 70/30 to 10/90 transition multiplies both by an order of magnitude. An agent that runs while humans sleep needs infrastructure that governs while humans sleep.

What Jensen Huang Told You Without Saying It

At GTC 2026, Jensen Huang stated that inference compute now rivals training compute in demand. The industry spent years building infrastructure for training. Inference was an afterthought: run the model, get a response, done.

Agents broke that assumption. An agent inference is not a single request-response cycle. It is a multi-step execution that may run for minutes or hours, consume tools, accumulate state, and produce side effects in production systems. The compute profile looks nothing like a chatbot query. It looks like a long-running service.

Inference optimization for agents means optimizing sustained execution, not latency on a single call. It means managing context windows that grow over time. It means detecting when an agent has drifted into unproductive loops and terminating the session before it burns through compute budget.

This is operations engineering. The model is a component. The infrastructure around it determines whether the component produces value or produces cost.

The Compound Problem

Each of these pieces interacts with the others in ways that make the total harder than the sum.

Invisible failures (78% of them) mean your observability layer needs to catch problems that produce no errors. Persistent failure patterns (93% surviving model upgrades) mean your governance layer cannot rely on “the next model will be better.” The human relay removal means your orchestration layer needs to encode judgment that was previously implicit. And inference at agent scale means your cost management needs to account for long-running autonomous workloads, not chatbot queries.

No single tool solves this. No single vendor covers it. Bessemer mapped 23 companies in RL platforms alone, and the harness infrastructure space is at least as fragmented. The organizations that will operate agents well are the ones building integrated governance across these layers, not shopping for point solutions.

What This Means for Your Infrastructure Roadmap

Three priorities emerge from this week’s data.

First, audit your human relay. Map every workflow where a person translates machine signals into agent prompts. Each one is a candidate for direct integration, but only after you build the routing policies, permission boundaries, and escalation thresholds that the human currently provides through judgment. Removing the human without replacing the governance creates an unmonitored autonomous system.

Second, instrument for invisible failures. If 78% of agent failures produce no error signal, your monitoring needs to detect anomalies in output quality, not just exceptions in execution. This means evaluation pipelines that sample agent work, comparison baselines that detect drift, and alerting on behavioral change rather than only on crashes.

Third, plan for inference as operations. Agent workloads are long-running, stateful, and expensive. Budget for them the way you budget for always-on services, not the way you budget for API calls. Include cost controls, circuit breakers, and utilization monitoring from day one.

The distance between a demo agent and an operational agent is not capability. The models are capable enough. The distance is infrastructure: the governance primitives, observability layers, and orchestration systems that let an agent close an incident at 3am without creating a worse incident at 3:01.

This analysis synthesizes AI Infrastructure Roadmap: Five Frontiers for 2026 (March 2026) and Engineering for Agents That Never Sleep (March 2026).

Victorino Group helps enterprises close the distance between agent capability and agent operations. Let’s talk.