Reliability in Regulated AI Comes From the Harness, Not the Model

TV
Thiago Victorino
8 min read
Reliability in Regulated AI Comes From the Harness, Not the Model

Bayer runs an agentic AI platform in production over decades of preclinical safety data. It is called PRINCE, built with Thoughtworks, live since early 2024. The Thoughtworks and Bayer case study published in June 2026 documents how it works. The most useful detail in the whole account is a constraint, not a model: the agent that writes SQL can only emit SELECT. DELETE, INSERT, and UPDATE are blocked at the harness layer, before the database ever sees them.

That single design choice carries the thesis. In a regulated environment, you do not earn reliability by asking a smarter model to behave. You earn it by building the surrounding system so the dangerous action is structurally unavailable.

The Model Is the Smallest Part

The PRINCE architecture spends most of its engineering budget on everything around the model. Two disciplines do the heavy lifting.

The first is context engineering: routing the right information to specialized agents instead of dumping everything into one prompt. A query fans out into five parallel expansions, each phrasing the user’s intent differently. The retrieval layer pulls roughly 20 chunks, reranks them, and keeps the top seven. The ranking is hybrid, weighted 0.7 toward semantic similarity and 0.3 toward keyword match, so a precise scientific term is not lost to fuzzy embeddings. Each specialized agent receives a tight, relevant slice rather than an undifferentiated context window.

The second is harness engineering: orchestration, state, and recovery. The Text-to-SQL path retries up to three times and caps results at 50 records. State lives in PostgreSQL, so when a run fails partway, it resumes from the failure point instead of restarting. The team runs daily evaluations against live traffic, which means regressions surface from real usage rather than a frozen test set.

None of that is model capability. It is plumbing. Reliable plumbing is what makes the model usable on data where a wrong answer has regulatory consequences.

The Control Is the Schema, Not the Prompt

Most teams that worry about an agent running a destructive query reach for the prompt. They write instructions: “you may only read data, never modify it.” That is a request. A capable model usually honors it. Usually is not a standard you can defend to an auditor.

PRINCE removes the request entirely. SQL schema validation is SELECT-only by construction. The agent cannot phrase a DELETE that the harness will execute, because the harness rejects anything that is not a read. The model’s behavior stops being the safety boundary. The boundary moves into code, where it is testable, reviewable, and identical on every run.

This is the same principle we traced across compute, data, knowledge, and identity in the agent containment stack. The data floor is not built by trusting the agent’s intentions. It is built by removing the agent’s ability to act outside its mandate. PRINCE is a production instance of that floor in a regulated setting, and the SQL constraint is the cleanest illustration anyone has published.

Verification Is Engineered In, Not Bolted On

Reading data safely is half the problem. The other half is trusting what the system says about that data. PRINCE handles trust with three reflection loops, used as checkpoints rather than decoration.

The process loop checks whether the agent followed the intended steps. The data loop checks whether the retrieved evidence actually supports the conclusion. The draft loop checks the generated answer before it reaches the user. Each loop is a place where the system can catch its own error before a human inherits it.

On top of the loops sits confidence scoring. When the named-entity recognition step produces a low-confidence extraction, the system flags it for human review instead of passing it through silently. The humans are not reviewing everything. They are reviewing the specific outputs the system already suspects.

Every answer also carries granular citations, down to the source document and page. A reviewer does not have to trust the generated text. They can open the cited page and confirm it. In a preclinical safety context, that traceability is the difference between an interesting demo and a tool a scientist will sign their name under.

This is the same coverage discipline we argued for in the review coverage deficit. Verification only counts when it is traceable to a source and routed by confidence, not when it is a uniform pass that treats every output as equally trustworthy.

What This Means for Your Build

The instinct in most agentic projects is to spend the budget on model selection and prompt tuning. PRINCE inverts that. The model is a component. The reliability comes from the harness around it, and from controls that are structural rather than advisory.

A caveat on the evidence. This is a Thoughtworks-authored account of a Thoughtworks-built system, so treat it as vendor lineage, not neutral third-party measurement. The case study describes architecture in detail but discloses no accuracy or cost benchmarks. The mechanisms are concrete and worth copying. The performance numbers are simply not on the table, so do not assume any.

What is copyable, today:

  • Move every irreversible action behind a structural block. If an agent should only read, make writes impossible in code, not discouraged in a prompt.
  • Cap and bound the loops. Three SQL retries, a 50-record limit, a result ceiling. Unbounded retries are how a stuck agent becomes an incident.
  • Persist state so failures resume instead of restart. PostgreSQL-backed checkpoints turn a crash into a pause.
  • Route review by confidence. Score the uncertain steps and send only those to a human. Uniform review wastes the reviewer and misses the real risk.
  • Cite to the source. Document and page, every time. Traceability is what lets a regulated user act on the output.

Do This Now

Open your highest-risk agent and find the most destructive action it can take. If the only thing stopping that action is an instruction in the prompt, you have a request where you need a wall. Move the control into the harness this week. Make the dangerous action impossible to express, then verify with a test that the harness rejects it. That single change does more for production reliability than any model upgrade on your roadmap.


This analysis synthesizes the Building Reliable Agentic AI Systems case study (Thoughtworks and Bayer, June 2026), which documents the PRINCE platform’s context-engineering and harness-engineering architecture.

Victorino Group helps regulated organizations design agentic systems where the controls live in the harness, not the prompt. Let’s talk.

All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation