Observability Is the Only Control Surface for Product Agents

Most of what we have written about agent observability lives inside the engineering loop. An agent runs in CI, a human watches the run, a reviewer reads the diff. The agent operates under supervision, and the trace is a convenience.

There is a harder case. The product agent that runs thousands of times a day, with no human in the loop, answering customers while you sleep. For that agent, the trace is not a convenience. It is the only place you can see what happened, and the only place you can change what happens next.

IndyDevDan put it bluntly in a recent walkthrough of his Pi coding agent: “If you don’t measure your agents, you’re not engineering. You’re gambling with tokens.” The line is aimed at builders, but it lands harder on operators. A product agent at scale is a fleet of small autonomous workers spending your money on every request. Without the trace, you do not run that fleet. You hope.

The control room you cannot see

A traditional API has a contract. You send a request, you get a response, you know roughly what it cost. A product agent has no such contract. Between request and response sit an unknown number of tool calls, an unknown token count, and a system prompt that was assembled at boot time from context you may never have read.

IndyDevDan demonstrates this directly. He sends the same instruction to two versions of his agent, one fed a spec in markdown, one fed the same spec in HTML. Same task, same intent. On one run, the markdown agent burned more tokens than the HTML agent: roughly 170 events against 100. Same prompt, different spec format, almost double the work. Nothing in the response would tell you that. Only the trace shows it.

This is the part operators miss. The output looked fine in both cases. The customer got an answer. The difference lived entirely in the execution path, invisible to anyone reading the result. Multiply one silent doubling across thousands of daily runs and you have a cost line nobody can explain at month end.

The hidden bill in the booted context

The second finding is worse, because it hides inside something most people treat as free. IndyDevDan notes that a single “say hello” through Claude Code cost him around twenty cents. Not twenty cents per session. Twenty cents to say hello, once.

The reason is the full booted context. When the agent starts, it loads system instructions, tool definitions, memory, project files, and conventions before it processes a single word from the user. That payload is billed on every turn. The cheapest possible message still drags the entire boot weight behind it. Tailscale’s own logs from their Aperture gateway corroborate the pattern independently: the framing context, not the user’s words, dominates the bill.

For a product agent, this is the whole game. You are not paying for the clever answer. You are paying, over and over, for the context you booted it with. If you cannot see that context in the trace, the full system prompt, every injected file, every tool schema, you cannot cut the bill. You can only watch it grow.

Why this sits upstream of unit economics

There is a value chain underneath every agent product. You spend tokens, you generate value from those tokens, you capture revenue from the value. IndyDevDan frames it as moving up the chain: use tokens, create value, capture the result. Most teams get stuck on the first step. They spend tokens and never learn whether value came out.

Observability is what moves you up that chain. The trace tells you cost per outcome, not cost in the abstract. It tells you which tool calls earned their keep and which ran for nothing. It tells you that the markdown spec costs you 70% more for the same result, so you switch formats and bank the difference across every future run.

Without that, a product agent is a cost center. Tokens go in, answers come out, and the relationship between the two is a mystery. With it, the same agent becomes something you can price, tune, and defend. The trace is the instrument that turns spending into economics.

That reframes the buyer. Engineering-agent observability is a developer tool: it helps the person watching the run. Product-agent observability is a business control: it helps the person who owns the margin. The dashboards may look similar. The stakes do not. One catches a bad diff. The other decides whether the product makes money at volume.

The methodology trap

A fair caution about the source. This is one creator, working from demos, with strong opinions. The spec-format finding is an anecdote from a single run, not a benchmark. Markdown will not always cost 70% more than HTML, and anyone who quotes that number as a law has missed the point.

The point is not the number. The point is that the number existed and was invisible until the trace surfaced it. The lesson generalizes even when the specific figure does not. Your product agent has its own silent doublings, its own twenty-cent hellos, its own booted weight nobody has read. You will not find them in the output. You will find them in the trace, or not at all.

Do this now

Pick your highest-volume product agent. Pull one full trace, start to finish: every tool call, the token count per turn, the cost, and the complete booted system prompt as the model actually received it. Read the system prompt out loud. Most teams have never seen it in full. Then ask one question of every line: is this earning its tokens at scale, or is it part of a hello that costs twenty cents?

Do that once and you stop gambling. You start operating.

This analysis synthesizes Pi Coding Agent Observability (IndyDevDan, June 2026).

Victorino Group helps teams instrument product agents so cost and quality stay visible at scale. Let’s talk.