A Tool's Output Is the Next Prompt: Designing Interfaces for Agent Readers

A tool runs. It prints something. A human glances at the output, decides what to do next, and moves on. That has been the implicit contract of the Unix toolbox for half a century. Structure was a convenience. Prose, color, emoji, summary lines at the bottom: all fine, because the reader on the other end was you.

That contract broke quietly the moment we started putting agents inside pipelines. The reader is no longer a person who can squint. It is a model whose next action will be conditioned almost entirely on the bytes you just emitted. The tool’s output is not a report. It is the prompt for the next turn.

We learned this the hard way rebuilding a Portuguese accent auditor we wrote a year ago. The fix took longer than we expected, not because the linguistics were hard, but because we had to change the unit of correctness. The old unit was “did the word get fixed.” The new unit is “did the agent get the right decision.”

What Goes Wrong When the Reader Changes

The original auditor was a dictionary-driven substitution map. It scanned Portuguese text, found unaccented forms of common accented words, and rewrote them in place. For a human author, this is a productivity win. You typed voce and meant voce, the tool fixed it, you moved on. The few false positives, like the imperative form of a verb that happened to collide with a noun, were caught by your eyes on the next read.

Three production incidents over six weeks told us that loop was no longer the loop we were running. The author was no longer a human. It was a content pipeline. A drafting agent wrote text, the auditor rewrote pieces of it silently, and the next agent in the chain received the silently-modified version as its prompt for the next stage. The bad rewrite became the input to the next decision. Nobody re-read the diff because there was no human in the seat where the diff would have been re-read.

The specific failures clustered around three words. Esta, the demonstrative pronoun, was getting auto-corrected to Está, the verb. Continua, the third-person verb, was being upgraded to a Latin loanword form that does not exist in modern Portuguese. Secretaria, a noun, kept becoming Secretária, a different noun. Each fix was wrong for that context, but the dictionary did not know context. It only knew word shapes.

You can patch a dictionary. You cannot patch the assumption that the reader will catch your mistakes.

The New Unit of Correctness

The rebuild changed exactly one thing: what the tool says about itself. The dictionary stayed. The orthography rules stayed. The new tool produces a three-way classification for every candidate change.

FIX. Unambiguous correction. A word that has exactly one canonical form, no context-dependent alternatives, no plausible legitimate use of the unaccented version. The tool applies these silently and logs them. voce to voce. nao to nao. Simple Hunspell territory.

REVIEW. Ambiguous. The candidate change has more than one plausible target form, or the context-free heuristic cannot decide. The tool does not fix these. It surfaces them with the original word, the candidate corrections, and enough surrounding context for the agent reading the report to make the call. Esta in mid-sentence with no leading verb is a likely demonstrative. Esta after a comma in a clause that already has a subject is probably the verb. Surface both options, hand the decision back.

SKIP. Not our problem. English loanwords, proper nouns, brand names, technical terms that do not have Portuguese accented equivalents. The tool ignores these on purpose and says so, so the next agent does not waste a turn asking why medium or dashboard was not flagged.

The taxonomy is not novel as linguistics. It is novel as interface design. The three categories map directly to the three things an agent can do with a verdict: act on it, reason about it, or move past it. A two-category tool (“fixed” versus “not fixed”) forces the next agent to parse prose to figure out which kind of “not fixed” it is looking at. A three-category tool answers that question in the schema itself.

Exit Codes Carry the Meta-Decision

The other piece of the rebuild was the exit contract. The old tool exited zero on success and non-zero on parse failure. That left no room for “ran fine, but you need to look at this.”

The new contract uses three codes.

0 means proceed. Everything the tool wanted to fix has been fixed, and there are no REVIEW entries. The pipeline continues.

10 means read the report and decide. The tool ran, applied its FIX entries, and produced one or more REVIEW entries that the next agent must act on before the pipeline can continue. The orchestrator stops on 10, surfaces the report, and waits for a decision.

2 is reserved for actual error conditions, broken input, missing config, dictionary not found.

Three codes, one meta-decision per code. The orchestrator does not have to parse the tool’s stdout to know what to do next. The exit number is the routing signal. Prose remains for humans who eventually read the log. The machine decision lives in the integer.

This is the same pattern that makes grep composable. grep exits zero when it found a match, one when it did not, two on error. Pipelines branch on the integer. The text is for you. The integer is for the shell.

What We Found When We Ran the New Tool Against the Archive

We audited 425 existing Portuguese articles with the rebuilt auditor. The first pass flagged 266 files. That number scared us until we read the reports. Most of the flags were REVIEW entries on TitleCase mid-sentence words, which the heuristic was conservatively surfacing rather than auto-fixing.

We tightened three guards. A stopwords list for unambiguous English terms that we had been treating as Portuguese candidates. An ambiguous-words list for loanwords like medium and dashboard that have both English and Portuguese readings. A TitleCase heuristic that surfaces mid-sentence capitalized words as REVIEW rather than auto-FIX, on the theory that capitalization mid-sentence is almost always a signal of intentional emphasis, brand, or proper noun.

After the three guards landed, the false positive rate on a sample of 22 files with confirmed real typos went to zero. Not low. Zero. The tool now fixes only what it is certain about and surfaces everything else for a decision.

That number matters less than the shape of the work that produced it. We did not write more rules. We wrote fewer rules with more honesty about which rules belonged in the FIX bucket and which belonged in REVIEW.

Why This Generalizes Beyond Portuguese

The accent auditor is a small example of a category. Any tool whose primary consumer is an agent inside a pipeline faces the same three pressures.

Silent auto-fixes compound. If your tool changes something and does not surface the change in a structure the next stage can read, the next stage works from the wrong input. A spellchecker that rewrites valid words inside a draft is the same shape of failure as a linter that auto-formats away a meaningful whitespace difference, or a content sanitizer that strips an attribution because it looked like a placeholder.

Ambiguity needs a home. Tools that collapse “maybe” into either “yes” or “no” force the consumer to second-guess every output. An agent that does not trust a tool stops using it. A REVIEW category is not a confession of weakness. It is the only honest answer for the cases where the tool cannot decide on its own.

Exit codes are cheap, prose is expensive. An integer in $? costs nothing to read. A paragraph of explanation costs tokens, time, and a non-trivial chance of misparsing. Move every decision the orchestrator must make into the exit code. Reserve prose for the humans who will eventually audit what the machines did.

Do This Now

Pick one tool in your pipeline whose primary consumer is now an agent. Open the source. Find every place the tool produces output that a downstream stage will read.

Ask three questions in order.

What does this tool silently change that a human would have caught on re-read? Make a list. Each entry is a candidate for promotion from FIX to REVIEW.

What does this tool collapse from ambiguous into a binary answer? Each collapse is a candidate for the REVIEW category. The cost of adding the category is small. The cost of the wrong binary answer flowing into the next stage is large.

What decision is your orchestrator currently making by parsing the tool’s stdout? Move that decision into the exit code. A two-line change in the tool removes a brittle string match from every pipeline that calls it.

The teams that win the next phase of agent operations will not be the ones with the most tools. They will be the ones whose tools speak honestly to the agents reading them.

This analysis synthesizes LibreOffice Dictionaries pt_BR (LibreOffice / Raimundo Moura, maintained since 2006), spylls (zverok, 2020-2024), and Five Levels of Bash Containment (Victorino Group, May 2026).

Victorino Group helps engineering leaders design tools that communicate cleanly with the agents driving them. Let’s talk.