Your AI Will Hallucinate. Build the System That Catches It.

A hotel chain asked their RAG-based assistant how many properties had swimming pools. The system retrieved relevant documents, processed 300 hotel FAQ pages, and returned a confident answer: 47.

The actual number was 133.

The model did not misread a document. It did what language models do with aggregation queries: it generated a plausible-sounding number from text chunks instead of computing one. Traditional RAG retrieves context. It does not compute over it. The model treated a counting problem as a text-generation problem and produced a convincing fabrication.

This is not a bug that better prompts will fix. It is a structural property of how probabilistic systems process information.

The Industry Stopped Trying to Eliminate Hallucinations

The research consensus shifted in 2025. MetaRAG and multiple production studies confirmed what practitioners already suspected: hallucinations are intrinsic to probabilistic generation. You cannot train them out. You cannot prompt them away. You can only build systems that detect, validate, and control them.

This reframing changes who owns the problem. Model improvement is vendor territory. System governance is yours.

We mapped how hallucinations work at the circuit level: Claude’s default state is refusal, and hallucinations occur when false “known entity” features suppress that refusal circuit. The model does not randomly generate false information. It incorrectly classifies something as known, then speaks with full confidence. That mechanistic understanding is useful. But understanding the disease does not treat the patient. This article is about treatment.

Five Layers, Not One Fix

Production teams that have reduced hallucination impact share a common pattern. They do not rely on a single technique. They stack interventions at different points in the system, each catching what the others miss.

Layer 1: Constrained decoding prevents impossible outputs at token generation.

Tools like LMQL (ETH Zurich) and Outlines enforce structural constraints during generation, not after. This is fundamentally different from post-hoc JSON validation. When the model generates tokens, constrained decoding makes certain token sequences impossible. The model cannot produce an invalid date format if the decoder will not emit those tokens. Lower temperature (0.3) reduces hallucination rates, though the effect varies by model and task. This layer costs almost nothing in latency and eliminates an entire class of structural fabrication.

Layer 2: Structured retrieval replaces text-chunk guessing with computation.

The hotel pool example came from testing Graph-RAG (Neo4j + Cypher queries) against traditional RAG on the same 300-document corpus. Where traditional RAG fabricated the pool count, Graph-RAG executed a Cypher query and returned 133. On out-of-domain queries (Antarctic hotel accommodations), RAG invented plausible answers. Graph-RAG returned nothing. The architecture is available as open source from AWS (aws-samples/sample-why-agents-fail on GitHub).

Returning nothing when you do not know is a feature. It is the feature most traditional RAG systems lack.

Layer 3: Neurosymbolic guardrails create rules the model cannot override.

As we explored in the governance of AI output quality, system-level constraints beat prompts because prompts inform while hooks enforce. The BeforeToolCallEvent pattern (from AWS Strands Agents) intercepts tool calls at the framework level before execution. When the cancellation flag is set, the model receives a blocked response it cannot prompt-engineer around. In testing, unguarded agents executed three out of three invalid operations. Guarded agents blocked all three.

NVIDIA’s NeMo Guardrails and Guardrails AI provide open-source implementations of this pattern. NeMo uses Colang (a domain-specific language for defining conversational rails). Guardrails AI uses Pydantic-based validators with over 50 pre-built checks. Both operate at the system level, outside the model’s influence.

Layer 4: Semantic tool pre-filtering reduces the hallucination surface.

This one is counterintuitive. Hallucination rates increase as agents gain access to more tools. The model spends tokens reasoning about which tool to use, and that reasoning itself can hallucinate. Pre-filtering 31 available tools to 3 relevant ones via FAISS vector similarity (before the model sees them) produced an 86.4% error reduction and 89% token cost reduction across 29 travel planning queries. Fewer choices means fewer opportunities to fabricate reasoning about choices.

Layer 5: Specialized detection models catch what passed through.

Galileo’s Luna-2 is a 3B/8B parameter model purpose-built for hallucination detection. It achieves 0.95 F1 at sub-200ms latency for roughly $0.02 per million tokens. Compare that to using an LLM-as-judge, which costs 50x more and introduces its own hallucination risk (the judge can hallucinate about whether the subject hallucinated). Open-source alternatives like RAGAS and DeepEval provide evaluation frameworks at zero licensing cost.

The Latency Budget Is Real

These layers are not free. But the overhead is manageable when you route by risk.

Pydantic validation runs in under 5 milliseconds. Constrained decoding adds negligible overhead to generation. Guardrail checks land between 10 and 50 milliseconds. A full LLM-based verification pass costs 300 to 2,000 milliseconds.

Production teams report 50 to 100 milliseconds of total overhead as acceptable. The trick is risk-based routing: run fast structural checks on everything, invoke expensive verification only on high-stakes outputs. A customer service chatbot answering “what are your hours” does not need the same validation pipeline as one generating financial projections.

Self-Reported Confidence Is Unreliable

One popular mitigation strategy is asking the model how confident it is in its answer, then filtering low-confidence responses. Research from NAACL 2024 and a JMIR 2025 study found this approach is structurally flawed. Models are systematically overconfident on incorrect outputs. They do not know what they do not know (which, given the hallucination circuit mechanics, makes sense: the model has already classified the information as “known” before generating the response).

Token-level log probabilities or external evaluation models provide more reliable confidence signals than self-assessment. This is the difference between asking someone “are you sure?” and checking their work.

RAG Does Not Solve Hallucinations. It Shifts Them.

RAG is widely treated as a hallucination cure. Retrieval-Augmented Generation grounds the model in real documents, so it should not fabricate. In practice, RAG introduces a different failure mode: faithfulness hallucination. The model retrieves correct context and then ignores or misrepresents it in the response.

The hotel pool example is one class (aggregation failure). Faithfulness failure is another. The model has the right document in its context window, extracts a claim from it, subtly transforms that claim during generation, and presents the transformed version as fact. The source material is right there. The output is still wrong.

This is why “just add RAG” is insufficient as a governance position. RAG changes which types of hallucination you face. It does not eliminate hallucination as a risk category.

The Regulatory Clock Is Running

The EU AI Act reaches full enforcement in August 2026. Fines run up to EUR 35 million or 7% of global turnover. Stanford’s HAI AI Index 2025 tracked a 56.4% year-over-year increase in AI safety incidents (149 to 233). The regulatory environment is responding to this trajectory.

Organizations shipping AI systems without hallucination governance are accumulating compliance exposure every quarter. The question is not whether you will need these controls. It is whether you build them before or after an incident forces the conversation.

What a Hallucination Governance Architecture Looks Like

The pattern that works is defense-in-depth with risk-based routing.

Input validation catches malformed or adversarial queries before they reach the model. Constrained decoding and structured retrieval prevent certain hallucination classes during generation. Output guardrails (both rule-based and model-based) validate responses before they reach users. Monitoring and evaluation frameworks provide ongoing visibility into hallucination rates across production traffic.

No single layer is sufficient. Each catches a different failure mode. The combination creates a system where hallucinations that slip past one control encounter another.

As the framework governance layer thickens and agent architectures become more complex, these controls become infrastructure, not optional additions. They are the difference between an AI system and a production AI system.

The hotel chain needed to know how many properties had pools. The answer was 133. The fact that a language model confidently said 47 is not surprising. The fact that no system existed to catch the error before it reached a business decision is the actual failure. That is the failure hallucination governance prevents.

This analysis synthesizes AWS Samples’ Why Agents Fail research and open-source test suite (March 2026), Stanford HAI AI Index 2025 incident tracking, Galileo Luna-2 hallucination detection benchmarks (2025), NVIDIA NeMo Guardrails and Guardrails AI documentation, and ETH Zurich’s LMQL constrained decoding research.

Victorino Group helps organizations build hallucination governance into AI systems before unreliable outputs compound into trust failures. Let’s talk.