Your Agent Already Knows What's Wrong — You're Just Not Listening

Factory published a detailed technical article in January. Not about a new model. Not about benchmark scores. About what happens after you deploy an agent — specifically, what happens when you actually watch it work.

Their system, called Signals, uses an LLM to analyze thousands of daily agent sessions. It identifies friction patterns — where users struggle, where agents fail, where conversations go sideways — without exposing the raw content of any individual session. The numbers from a sample report: 1,946 sessions across 39 batches. 58% contained friction moments. 83% contained delight moments. Those two statistics coexisting tells you everything about the current state of agentic AI.

Most organizations would stop at “58% friction.” Signals does not stop there.

The Seven Signals of Agent Friction

Factory identified seven distinct friction patterns in agent sessions, each with different severity distributions:

Error events (35% high severity) — the obvious ones. The agent throws an error, the user notices.

Repeated rephrasing (42% high severity) — the user asks the same thing three or more times in different ways. This means the agent heard but did not understand.

Escalation tone (28% high severity) — the user’s language shifts from collaborative to directive. Frustration leaking through word choice.

Platform confusion (15% high severity) — the user misunderstands what the agent can do. A product design problem, not an AI problem.

Abandoned tool flows (48% high severity) — the user starts using a capability and gives up midway. The highest severity rate in the taxonomy, and the most underdiagnosed.

Backtracking (22% high severity) — the user manually undoes what the agent did. Silent rejection.

Context churn (38% high severity) — users repeatedly adding and removing items from the context window. This one is special.

Context Churn: The Leading Indicator Nobody Watches

Of all seven friction types, context churn is the most interesting because it precedes the others. Users fidgeting with context — adding a file, removing it, adding a different one, removing that too — is a behavioral signal that the user has lost confidence in the agent’s ability to understand the task. They are trying to make the agent understand by restructuring its inputs, and failing.

Factory found that context churn often appears before any of the other six signals. It is the leading indicator, not the lagging one. By the time the user starts rephrasing or escalating, the friction has been building for several turns.

If your monitoring only catches errors and timeouts, you are seeing the symptoms. Context churn is the diagnosis.

Error Recovery Beats Error Prevention

Here is the counterintuitive finding: sessions where agents recovered gracefully from errors scored higher on delight than sessions that were flawless from start to finish.

This inverts the standard engineering priority. Most teams optimize for preventing failures. Factory’s data suggests they should optimize for recovering from them. A user who watches an agent stumble, acknowledge the problem, and fix itself comes away more impressed than a user whose session happened to go smoothly.

The implication for operations is significant. Agent reliability is not about eliminating errors. It is about ensuring the system detects errors, communicates them transparently, and resolves them quickly. Resilience over perfection.

The Self-Healing Loop

The most operationally interesting part of Signals is not the detection — it is what happens after detection.

When friction patterns exceed a threshold, the system auto-files a ticket in Linear. A Droid (Factory’s autonomous agent) self-assigns the ticket, implements a fix, and conducts its own code review. A human then approves or rejects the merge.

That human approval gate matters. This is not a fully autonomous system. It is a semi-autonomous system with a human checkpoint at the most consequential moment — the moment code enters production. The pattern is: automated detection, automated diagnosis, automated implementation, human judgment on deployment.

The results: 73% of auto-filed issues resolved without manual intervention beyond that approval gate. Average fix deployment time under four hours. A 30% reduction in repeated rephrasing friction after the system implemented improved ambiguity handling.

This is a closed loop operating at production scale. Not a research prototype. Not a demo. A system processing nearly two thousand sessions daily and improving itself based on what it observes.

The Observability Gap

The Eno Reyes interview with Stack Overflow in February 2026 adds an important dimension. Factory’s CEO argues that code quality baseline — not code volume, not AI adoption rate — is the strongest predictor of whether AI agents will accelerate or decelerate an engineering organization.

Factory identifies hundreds of validation signals in software development: compilation, linting, test passage, documentation quality, complexity scores, security scans. Most organizations implement very few of these comprehensively. The thesis is that autonomy comes not from better models but from bringing in more of those signals automatically.

This maps directly onto the Signals system. The agent improves not because someone fine-tuned a model. It improves because the system observes its own performance, quantifies the friction, and applies targeted fixes. The signals are the path to autonomy.

What This Means for Operations

Most conversations about AI agents focus on capability: what can the agent do? Factory’s contribution is reframing the question: what can you see about what the agent does?

An agent operating without observability is like a manufacturing line without quality control. It might produce good output for a while. You will not know when it stops, and you will not know why.

Factory’s seven-signal taxonomy is not the only possible framework. But it illustrates the minimum viable observation: you need to see errors (the obvious), behavioral patterns (the subtle), and leading indicators (the predictive). Most organizations today can barely see errors.

The self-healing loop is the aspiration. Before you get there, you need to answer simpler questions: How many of your agent sessions have friction? What kind? Which patterns repeat? What is the fix cycle time when you do catch something?

If you cannot answer those questions, your agent is already telling you what is wrong. You are just not listening.

The Privacy Question

One objection to comprehensive agent monitoring is privacy. If you analyze every session, you see everything users do.

Factory addresses this through layered abstraction. The LLM extracts behavioral patterns while omitting specific content. Individual session results aggregate into statistics that are meaningful only at scale. Patterns surface only across enough distinct sessions to prevent identifying individual users.

The approach works, but the implementation details matter. Factory uses BigQuery and OpenAI’s Batch API for processing, which means session data — however abstracted — flows through external infrastructure. Organizations with strict data residency requirements will need to adapt this architecture.

The point is not that Factory’s specific implementation is universal. The point is that the privacy objection is solvable. The observability gap is a choice, not a technical constraint.

From Reactive to Predictive

Factory’s roadmap points toward real-time friction indicators during active sessions — not just daily batch analysis. Beyond reactive fixes, Signals identifies missing capabilities when clusters of sessions reveal repeated requests for features that do not exist. One recent system proposal: tracking “specification drift,” where users gradually shift their goals mid-conversation.

This trajectory — from batch observation to real-time monitoring to predictive intervention — mirrors the evolution of traditional infrastructure observability. We went from log files to dashboards to anomaly detection to auto-remediation. Agent observability is following the same path, compressed into months instead of decades.

The organizations that build this infrastructure now will have a compounding advantage. Not because their agents are better — the models are increasingly commoditized. Because they can see what their agents do, measure how well they do it, and improve the system systematically.

That is what operating AI actually looks like.