When Watching Every Agent Trace Gets Cheap, "We Couldn't See It" Stops Being a Defense

TV
Thiago Victorino
7 min read
When Watching Every Agent Trace Gets Cheap, "We Couldn't See It" Stops Being a Defense

We have argued before that the trace is the only place you control a product agent. That argument had a hole. Everyone nodded, then said the same thing: we cannot afford to look at all of them.

They were right, until last week. You could never run a coding agent or an LLM judge over 100% of production traces. At thousands of runs a day, the inference bill alone made full coverage a fantasy. So teams sampled. They looked at 1%, or at the runs that errored, and called it observability. The other 99% ran in the dark, and “we couldn’t see it” was a true and acceptable answer.

Braintrust just published the architecture that ends that excuse. Their Topics pipeline classifies a brand new trace in roughly 100 milliseconds with zero LLM call at classification time. Full coverage stopped being expensive. Once it is cheap, the defense that you could not see what your agents did stops holding up.

The trick: do the expensive thinking once

The reason full coverage was costly is that everyone put the model in the hot path. Every new trace meant another inference call to read it, judge it, label it. Scale the traffic, scale the bill. Braintrust broke that coupling with a six-stage pipeline that front-loads the cost.

The pipeline runs preprocess, facet, embed, cluster, name, classify. Preprocess caps each trace at 128K tokens. Facet uses a Gemma model on Baseten to pull structured observations out of the trace, around 10 seconds of work. Embed turns each facet into a 1024-dimensional vector. Cluster groups them with HDBSCAN over a UMAP reduction, handling up to 50,000 facets in about 30 seconds. Name labels each cluster. Only then comes classify.

And classify is the move. A new trace gets faceted and embedded, then matched against the existing clusters by vector distance. No model reads it to decide where it belongs. The expensive semantic work happened once, when the map was built. Every trace after that is a cheap geometric lookup, roughly 100 milliseconds, no LLM call. You pay the heavy cost to draw the map, then ride it for free on every request.

That is the whole economic shift. Active observability over 100% of traffic is now a vector lookup, not an inference fleet.

Cheapness is not a feature. It is an obligation.

When something governance-relevant gets cheap, it quietly changes from “nice to have” into “no excuse not to.”

Think about what “we sampled 1%” meant as a legal and operational posture. It meant that when an agent went wrong at scale, leaked something, gave bad advice, burned money on a silent loop, you could say the run fell outside the sample. The cost of looking everywhere was your alibi. Auditors accepted it because the alternative was genuinely unaffordable.

Strip out the cost and the alibi goes with it. If classifying every trace costs 100 milliseconds and no model call, then “we did not see it” no longer means “it was too expensive to see.” It means “we chose not to look.” Those are very different sentences in an incident review, and only one of them is survivable. The Braintrust number does not just make a product faster. It moves a whole class of failures from forgivable to negligent.

This is the pattern worth internalizing. Every time the cost of seeing collapses, the standard of care rises to match. Cheap full coverage is the new floor, not the new ceiling.

Classifications become columns you can join on

The deeper consequence is structural. Once every trace carries a cluster assignment computed in real time, that assignment stops being a chart in a dashboard. It becomes a column.

A cluster ID attached to every trace is data you can query. You can alert on it: page me when traffic into the “refund dispute escalated” cluster jumps 3x in an hour. You can join on it: cross the cluster against your billing table and find that one behavior pattern accounts for 40% of token spend. You can gate on it: route any trace landing in a known-bad cluster to human review before the response ships. None of that needs a human reading traces. It needs a column and a WHERE clause.

This is the floor we have been building toward. The trace is the control surface. Continuous classification is what turns that surface into something a machine can act on without a person in the loop. You do not read 10,000 traces. You alert on the cluster that should not be growing, and you join the cluster that costs too much against the table that proves it.

One honest caveat: trust the cluster, not the name

Braintrust is direct about a limitation, and it matters for anyone building on top of this. The names drift. Run the naming stage twice and the cluster that was “billing question” becomes “payment inquiry.” So they treat the cluster, not the name, as the stable identity.

That is a real design constraint, not a footnote. If you build alerts or routing rules on the human-readable label, they break the next time the map regenerates. Bind your governance logic to the cluster identity underneath, the geometric region in embedding space, and let the name be a convenience for humans reading the board. The approach traces back to Anthropic’s Clio, which mapped real-world AI use while preserving privacy. Braintrust adapted it for agent traces, which are far less uniform than chat logs, and inherited the same lesson: the structure is stable, the words you hang on it are not.

Do this now

Find out what fraction of your production agent traces you classify today. For most teams the honest answer is the error rate plus a thin sample, and the rest runs blind. Then ask the question that just changed: if active classification over 100% of traffic costs 100 milliseconds per trace and no model call, what is your actual reason for not running it? Write that reason down. If it is still “too expensive,” check it against the Braintrust architecture, because that excuse expired last week. If the real reason is “we never built it,” that is a roadmap item, not a defense.

Pick one cluster that should never grow. Wire an alert to it. That is full-coverage governance, started in an afternoon.


This analysis synthesizes How We Made Continuous Trace Intelligence Possible at Scale (Braintrust, June 2026) and Clio (Anthropic, December 2024).

Victorino Group helps teams turn full-coverage trace intelligence into a governance obligation they can meet. Let’s talk.

All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation