Operating AI

From In-the-Loop to On-the-Loop: What Companies Running AI Agents Actually Do Differently

TV
Thiago Victorino
10 min read
From In-the-Loop to On-the-Loop: What Companies Running AI Agents Actually Do Differently

CircleCI’s 2026 State of Software Delivery report analyzed 28 million workflows. The top finding should bother anyone selling the “AI makes everyone faster” narrative: median main branch throughput dropped 7% year over year. The industry-wide success rate hit 70.8%, a five-year low. Mean recovery time climbed to 72 minutes, up 13%.

The top 5% of teams doubled their throughput. The other 95% did not.

Fewer than 1 in 20 teams managed what CircleCI calls balanced growth: writing more code and shipping it successfully. The rest either wrote more and shipped less, wrote less and shipped more, or regressed on both dimensions. A necessary qualifier: CircleCI measures “throughput” as workflow runs, not features delivered. More workflow runs can mean more retries, more CI churn, more agents spinning against broken builds. The top 5% may be shipping more features, or they may be running more automated loops. The report does not distinguish.

Still, the signal matters. Something separates the 5% from the rest. It is not access to better models. Everyone has access to the same models. The difference is what sits around the models.

Two Loops, Not One

Martin Fowler’s team at Thoughtworks published a framework in late 2025 that names the distinction cleanly. Dave Morris, writing on Fowler’s blog, describes two operating modes for humans working with AI agents.

In the loop: the human reviews AI output before it ships. Every pull request gets a pair of eyes. The developer reads the diff, checks the logic, approves the merge. This is how most teams use Copilot and Cursor today. The AI generates. The human gates.

On the loop: the human engineers the system that produces and validates AI output. The developer does not read every diff. Instead, they build the test suites, CI pipelines, linting rules, and quality gates that catch problems automatically. The human watches the system’s health metrics, not individual outputs.

A note on terminology. “On the loop” is not Fowler’s invention. The phrase comes from the U.S. Department of Defense Directive 3000.09 (2012), which defined three modes for autonomous weapons: human in the loop, human on the loop, and human out of the loop. Morris adapted the framework for software engineering. The intellectual lineage matters because the military framing carries a useful insight: the progression from in-the-loop to on-the-loop is not casual. It requires engineering the oversight mechanism before you relax the human’s direct involvement.

The 5% in CircleCI’s data are on-the-loop teams. The 95% are still in the loop.

The Why Loop and the How Loop

The distinction maps onto something more fundamental. Engineering work has always contained two loops.

The Why Loop is about intent: What should we build? Why does it matter? Is it working? These are decisions about direction, priority, and outcome. Humans own this loop because it requires judgment about context that exists outside the codebase.

The How Loop is about execution: Write the code. Run the tests. Fix the linting errors. Deploy to staging. These are deterministic sequences with clear success criteria. Agents are increasingly competent at this loop because the feedback signals are legible: tests pass or fail, the linter accepts or rejects, CI goes green or red.

The 5% teams are shifting human time from the How Loop to the Why Loop. Not by abandoning quality standards. By encoding those standards into automated systems that agents can execute against.

As we examined in The 65% Rule, production AI systems converge toward mostly-deterministic architectures. That convergence is the How Loop being automated. The deterministic nodes in Stripe’s Blueprint Engine, the CI pipelines in CircleCI’s top performers, the test suites in Carlini’s compiler project. All of them are How Loop infrastructure that lets agents operate without human review of every artifact.

What the On-the-Loop Companies Actually Look Like

Three companies published enough operational detail in early 2026 to examine their approaches concretely.

Ramp: Measuring Organizational Capability

Ramp, the corporate card company, hit $1 billion in annualized revenue with roughly 311 engineers. Their CPO, Geoff Charles, claims 25 product managers overseeing 500+ features in their 2026 roadmap. I have not verified the feature count independently, and a growth-stage company’s definition of “feature” may differ from yours. But the structural claim is testable: Ramp publishes their AI proficiency levels.

They use a four-tier model. L0: basic autocomplete usage. L1: interactive AI pair programming. L2: delegating full tasks to agents. L3: multi-agent workflows operating with minimal supervision.

The interesting part is not the taxonomy. It is that Ramp measures this at the organizational level. They track which teams are at which level, and they staff product managers based on the assumption that engineering velocity is increasing. The lean PM ratio may reflect growth-stage culture as much as AI maturity. But the measurement system itself is the tell. They are not asking “are our developers using AI.” They are asking “what level of autonomy can our codebase support.”

Factory: Readiness as a Property of Codebases

Factory, an AI agent company valued at $300 million with 55 employees, published something counterintuitive in early 2026. Their five-level maturity model for agent readiness evaluates the codebase, not the engineers.

The model has 60+ criteria across eight pillars. Test coverage, CI pipeline reliability, documentation quality, type safety, dependency management. The thesis: agents perform well in codebases that are well-governed, and poorly in codebases that are messy. The limiting factor for agent productivity is not the model’s capability. It is the infrastructure the model operates within.

Factory reports that evaluation variance across their criteria dropped from 7% to 0.6% after grounding assessments in specific, observable properties rather than subjective ratings. That precision matters because it makes the model actionable. You can improve your type coverage. You cannot improve your “AI readiness” in the abstract.

A caveat: Factory sells AI agent services. Their maturity model is also a sales qualification framework. Companies that score low on Factory’s assessment need Factory’s help to improve. This does not invalidate the framework, but it should calibrate how you weight it. Independent evidence supports the direction: the teams in Agent Teams and the Shift from Writing Code to Directing Work found the same pattern. Environment quality determined agent output quality more than anything else.

McKinsey’s Projections (Not Measurements)

McKinsey published two relevant pieces in early 2026. One on banking operations, projecting that employees will shift from 80% coordination tasks to 80% decision-making tasks as agents handle routine execution. Another on agentic commerce, projecting $3-5 trillion in value by 2030 and describing a six-level automation curve.

These are projections, not measurements. McKinsey is explicit about this. The banking piece describes a pilot where one person manages 20-30 agents, but the sample is small and the timeline is aspirational.

I include them not as evidence but as directional signal from a firm with extensive access to large enterprise operational data. The direction matches what the measured data shows: humans moving from execution to oversight, agents handling deterministic sequences, organizational value shifting toward the systems that govern agent behavior.

The Validation Bottleneck

Here is the real problem that the CircleCI data exposes. AI made code generation cheap. It did not make validation cheap.

Writing a pull request takes an agent minutes. Reviewing that pull request for correctness, security implications, architectural fit, and business logic accuracy takes a human the same time it always did. If you double the volume of code generated without doubling your validation capacity, you create a bottleneck. The 95% of teams whose throughput did not improve may be experiencing exactly this.

The success rate decline to 70.8% supports the hypothesis. More code is being written and submitted. Less of it passes on the first try. Recovery times are climbing because the failures are distributed across more concurrent work streams.

The on-the-loop response is not “hire more reviewers.” It is “automate the validation.” Build the test suites that catch the problems before a human needs to look. Improve CI pipeline reliability so that failures are deterministic and informative. Invest in type systems, linting rules, and architectural constraints that prevent entire categories of error at the system level.

This is expensive upfront. It is invisible in quarterly productivity metrics. And it is the only approach that scales.

The Agentic Flywheel

The most interesting pattern across all three companies is a feedback loop that does not get discussed enough.

In-the-loop teams have a linear workflow: human defines task, agent executes, human reviews. The learning from each cycle lives in the human’s head. If the agent produced a bad pattern, the human corrects it manually next time. Knowledge accumulates in people, not systems.

On-the-loop teams have a circular workflow. Agents execute against automated checks. When agents fail, the failures are logged and categorized. The categories inform improvements to the automated checks. Better checks produce better agent output. Better output increases trust. Increased trust permits more autonomy. More autonomy frees humans to improve the checks further.

This is a flywheel. Each revolution improves the system’s capacity to produce correct output without human intervention. The human’s role is maintaining and improving the flywheel, not operating it manually.

Ramp measures this with their L0-L3 scale. Factory evaluates the codebase properties that enable it. CircleCI’s data shows the throughput difference between teams that have it and teams that do not.

What This Means for Your Organization

Agent readiness is an infrastructure problem, not a training problem. Sending engineers to AI workshops does not help if your test coverage is 30% and your CI pipeline takes 45 minutes. Start with the codebase. Factory’s insight is correct even if their commercial interest is visible: the infrastructure determines the ceiling.

Measure the system, not the individual. Tracking “AI adoption” per developer is in-the-loop thinking. Track your CI success rate, mean time to recovery, test coverage, and the ratio of automated to manual validation steps. These are the metrics that predict whether agents can operate with reduced supervision.

Budget for the validation layer. Every dollar spent on code generation without a matching investment in code validation is a dollar spent creating a bottleneck. CircleCI’s data is a 28-million-workflow proof that generation without validation creates churn, not velocity.

Progressive autonomy, not binary automation. No company in this analysis went from “humans write everything” to “agents write everything” in one step. Ramp has four levels. Factory has five. McKinsey describes six. The progression is deliberate, measured, and reversible. Each step requires proving that the automated checks are reliable before expanding agent autonomy.

The survivorship problem is real. Every company cited here is an elite technology organization. Ramp, Factory, and Stripe have engineering cultures, tooling budgets, and talent access that most organizations lack. The principles transfer. The timelines probably do not. If you are running a 50-person engineering team with legacy infrastructure, the path to on-the-loop operations is longer than these examples suggest. That is not a reason to avoid starting. It is a reason to be honest about what “starting” looks like for your context.

The transition from in-the-loop to on-the-loop is not a philosophy change. It is an engineering project. You build the test suites, the CI pipelines, the quality gates, and the monitoring systems that allow agents to operate with progressively less human review. The companies pulling ahead are not the ones with the best models or the most enthusiastic developers. They are the ones who treated the harness as a first-class product and invested accordingly.

Fifty-seven percent of companies now have AI agents in production, according to G2’s August 2025 data. That number will keep climbing. The question that determines outcomes is not whether you deploy agents. It is whether you build the systems that make agent output trustworthy without a human reading every line.


This analysis synthesizes CircleCI 2026 State of Software Delivery (January 2026), Thoughtworks / Martin Fowler’s blog on in-the-loop vs on-the-loop (November 2025), Ramp CPO interview on AI proficiency levels (February 2026), Factory’s agent-readiness maturity model (January 2026), McKinsey on agentic AI in banking (February 2026), and G2 AI agent adoption survey (August 2025).

Victorino Group helps engineering organizations build the validation infrastructure that turns AI agent deployments into measurable operational gains. Let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation