Operating AI

The Governance Loop Hidden in Your Agent Monitoring

TV
Thiago Victorino
9 min read
The Governance Loop Hidden in Your Agent Monitoring

Harrison Chase published a long post in February about agent observability. The LangChain CEO’s thesis: you cannot predict what an agent will do until it runs in production, and the tools for watching it there need to be fundamentally different from traditional APM.

He is right about the first part. He is wrong about why.

Chase frames the problem as “infinite input space.” Agents face an unbounded range of inputs, unlike traditional software with defined endpoints. This sounds compelling until you realize search engines have handled infinite input spaces for two decades. Google processes 8.5 billion queries per day. The queries are infinitely variable. The infrastructure for monitoring them is well understood.

The actual insight is subtler, and Chase buries it in the middle of his post. Multi-step tool-calling chains compound non-determinism. Each step introduces variance. An LLM choosing tool A over tool B at step two changes the entire downstream execution. Monday.com confirmed this directly: “a minor deviation in a prompt or tool-call result can cascade into a significantly different outcome.” The challenge is not infinite inputs. It is compounding uncertainty across sequential decisions.

That distinction matters because it points toward a different solution than the one Chase sells.

The Production Improvement Loop

Chase proposes a cycle: collect traces, route interesting ones to annotation queues, build datasets from annotations, run experiments against those datasets, deploy online evaluations, repeat. He calls it the “production improvement loop.”

It works. The pattern is genuinely useful, and any organization running agents at scale should implement something like it. The loop creates a feedback mechanism between production behavior and system improvement. Without it, you are flying blind.

But notice what Chase actually described. Annotation queues are compliance review by another name. Evaluation rubrics are control policies. Cross-functional teams (product managers, domain experts, compliance officers reviewing agent outputs) are a governance structure. Online evaluations are runtime monitoring. Sampling strategies are risk-based oversight.

Chase built a governance framework. He just never uses the word.

Why the Label Matters

This is not a semantic game. The difference between “observability tooling” and “governance framework” changes who owns it, how it scales, and what it costs.

When you call it observability, engineering owns it. The SRE team picks a vendor, instruments the traces, builds dashboards. Product managers look at the dashboards occasionally. Compliance never sees them.

When you call it governance, ownership expands. The annotation queue becomes a compliance function. The evaluation rubric becomes a policy document. The cross-functional review team becomes a standing committee with defined authority. The sampling strategy becomes a risk assessment.

As we explored in The Operations Gap, Anthropic’s study of millions of agent sessions showed experienced users naturally shift from per-action approval to active monitoring with strategic intervention. That organic trust calibration works for individual users. It does not work for organizations. Organizations need the explicit structure that governance provides. Chase’s loop is that structure, misidentified.

The Compounding Problem

Academic research supports the core claim about non-determinism. A November 2024 paper (arxiv:2411.15594) found that LLM-as-judge evaluation achieves roughly 85% alignment with human judgment, which actually exceeds human-human agreement at 81%. Binary pass/fail judgments prove more reliable than numeric scores.

This means automated evaluation can work. But it also means the evaluation layer needs the same governance discipline as any other quality system. Who defines the rubric? Who audits the judge? What happens when the LLM evaluating the LLM is wrong in the same way the original LLM was wrong? These are governance questions, not engineering questions.

Pre-production testing cannot answer them. MMLU scores above 80% tell you nothing about how an agent behaves when a customer sends a request in broken English at 2 AM and the primary API is returning 500 errors. Goodeye Labs documented this directly in 2025: benchmark performance and production performance are essentially uncorrelated for complex agent workflows.

What Chase Gets Wrong

Three things deserve challenge.

First, the claim that general-purpose tools “fall short” for agent monitoring. This was arguably true in 2024. It is not true in 2026. Datadog launched AI Agent Monitoring in June 2025 with decision graph visualization and tool invocation tracing. OpenTelemetry’s GenAI semantic conventions are maturing rapidly. Langfuse offers an MIT-licensed self-hosted alternative to LangSmith. The infrastructure is catching up. The question is no longer “do tools exist?” but “do you know how to use them?”

Second, the omission of cost. Chase mentions sampling strategies (10-20% of traces) without citing a source for those numbers and without addressing what comprehensive tracing actually costs. At scale, trace storage and LLM-as-judge evaluation become significant line items. An organization processing 10,000 daily agent interactions could face $75K per month in trace-related costs. As we discussed in The Operations Tax, the overhead of governance infrastructure is real. Pretending it does not exist serves vendors, not practitioners.

Third, data sovereignty. Chase’s entire framework assumes you can send every prompt and response to a third-party SaaS platform for analysis. For regulated industries (healthcare, finance, legal), this assumption fails immediately. The governance framing makes this obvious: you would never send your compliance data to an uncontrolled external system. But the observability framing hides it behind “just another analytics tool.”

The Convergence

Here is what makes this moment interesting. Observability teams and governance teams are building toward the same thing from opposite directions.

Observability teams started with technical monitoring (latency, errors, throughput) and are adding semantic evaluation (was the output correct? was it safe? did it follow policy?). Governance teams started with policy frameworks (acceptable use, risk classification, approval workflows) and are adding technical enforcement (runtime monitoring, automated evaluation, trace analysis).

They will meet in the middle. The organizations that recognize this convergence early will build one system instead of two.

Factory’s Signals framework, which we covered in Your Agent Already Knows What’s Wrong, approaches this from the practitioner side. Their system identifies behavioral friction patterns in agent sessions and auto-resolves 73% of issues without manual intervention. That self-healing loop is observability and governance fused into a single operating discipline.

Chase’s production improvement loop is the same fusion, viewed from the platform vendor side.

What This Actually Looks Like

A converged observability-governance system has six properties:

Defined authority. Someone owns the evaluation rubric. Changes require approval. This is not an engineering decision. It is a policy decision.

Risk-based sampling. Not 100% of traces, not a flat percentage. High-risk interactions (financial transactions, medical advice, legal guidance) get full review. Low-risk interactions get statistical sampling. The sampling strategy follows the risk classification, the same way audit sampling follows materiality thresholds.

Cross-functional review. Engineers read traces for technical failures. Product managers read them for user experience failures. Compliance officers read them for policy violations. Domain experts read them for factual errors. No single function can evaluate agent behavior comprehensively.

Closed-loop improvement. Observations feed back into the system. Not just dashboards that someone might look at. Annotations that trigger dataset updates that trigger experiments that trigger deployments. Chase’s loop, with governance authority attached.

Vendor independence. The framework should work whether you use LangSmith, Langfuse, Datadog, or a custom OpenTelemetry pipeline. Vendor lock-in in your governance infrastructure is an unacceptable risk. If your evaluation rubrics live inside a proprietary platform, you do not own your governance.

Cost transparency. Every trace costs money. Every LLM-as-judge evaluation costs money. The governance budget should be explicit, not hidden inside an engineering infrastructure line item where nobody questions it.

The Uncomfortable Implication

If agent observability is governance, then most organizations are running ungoverned AI systems.

They have dashboards. They watch error rates. Some of them have even implemented evaluation pipelines. But they have not done the governance work: defining authority, establishing risk-based oversight, building cross-functional review, ensuring vendor independence.

The tools exist. The pattern exists (Chase documented it clearly, whatever he chose to call it). The convergence between observability and governance is already happening in the infrastructure layer.

The question is whether your organization will recognize it before an incident forces the recognition for you.


This analysis synthesizes Harrison Chase’s “You don’t know what your agent will do until it’s in production” (Feb 2026), academic research on LLM-as-a-Judge (Nov 2024), and Datadog’s AI Agent Monitoring launch (Jun 2025).

Victorino Group helps organizations build governance frameworks for AI operations. Let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation