When Your AI Explains Its Reasoning, It's Making It Up

Ask Claude to add 36 + 59 and it will show you the carrying algorithm. Ones column: 6 plus 9 equals 15, carry the 1. Tens column: 3 plus 5 plus 1 equals 9. Answer: 95.

The explanation is correct. The math is right. And according to Anthropic’s own interpretability research, none of it reflects what actually happened inside the model.

Claude did not carry the 1. It ran parallel magnitude estimation and digit-specific computational pathways simultaneously. The carrying explanation came after, generated to look plausible. The model computed the answer one way and explained it another.

Philosopher Harry Frankfurt has a precise term for producing plausible derivations without actual underlying computation. He calls it bullshitting. Not lying (which requires knowing the truth and choosing to contradict it) but constructing narratives that sound right without concern for whether they reflect reality.

This distinction matters for anyone building governance around AI systems. Because a growing number of enterprises are treating model explanations as evidence of model reasoning. They are not.

What Interpretability Actually Found

ByteByteGo synthesized Anthropic’s interpretability research in March 2026, and the findings challenge assumptions that most AI governance frameworks take for granted.

Anthropic’s researchers used feature-decomposition techniques to replace opaque neurons with interpretable units, then built attribution graphs to trace computational pathways through the model. What they found is that Claude thinks in abstract, language-independent concepts. It does not think in English, or in any language. It processes meaning at a level that precedes language entirely.

The evidence: Claude 3.5 Haiku shares more than twice the proportion of its internal features between languages compared to smaller models. A concept like “danger” or “addition” exists as a single internal representation regardless of whether the prompt arrives in English, Mandarin, or Portuguese. The model’s internal world is conceptual, not linguistic.

This matters because the explanations models produce are linguistic. They are translations from a conceptual space into words. And like all translations, they are approximate. The model is not reporting its computation. It is generating a plausible account of how a human might have arrived at the same answer.

The Hallucination Circuit

The interpretability research also mapped how hallucinations happen at the mechanistic level. This is worth understanding in detail because it contradicts the common assumption that hallucinations are random failures.

Claude’s default state is refusal. The model starts from a position of “I should not make claims about this” and only overrides that default when sufficient evidence accumulates. Hallucinations occur through a specific mechanism: false “known entity” features activate and suppress the refusal circuit. The model does not randomly generate false information. It incorrectly classifies something as known, which turns off the safety check that would have prevented the false claim.

This is a designed behavior misfiring, not chaos. The model has a refusal system. The refusal system works. Hallucinations happen when a different system (entity recognition) incorrectly tells the refusal system to stand down.

For governance purposes, this finding is both reassuring and alarming. Reassuring because the failure mode is specific and potentially addressable. Alarming because it means hallucinations arrive with the same confidence as accurate responses. The model does not hedge when it hallucinates. It has already classified the information as “known.”

Why This Breaks Chain-of-Thought Governance

Several enterprise AI governance frameworks treat chain-of-thought reasoning as an audit trail. The logic goes: if the model shows its work, we can verify its reasoning, catch errors, and build trust through transparency.

The interpretability research dismantles this logic at the foundation.

Chain-of-thought output is not the model’s reasoning made visible. It is the model’s post-hoc construction of a plausible reasoning narrative. The actual computation runs through parallel pathways, magnitude estimations, and feature activations that bear no structural resemblance to the step-by-step explanation the model produces.

In The AI Verification Debt, we argued that verification is governance. The interpretability findings show precisely why. Model explanations cannot substitute for external verification because the explanations are not descriptions of what happened. They are stories about what might have happened, generated by the same system that produced the original output.

Consider what this means in practice. A financial model explains its risk assessment. A medical AI explains its diagnosis. A legal AI explains its contract analysis. In each case, the explanation looks like reasoning. It follows logical steps. It references relevant factors. And none of it necessarily reflects the actual computational process that produced the conclusion.

The conclusion might be correct. The explanation might even be useful as a teaching tool. But treating it as evidence of how the model arrived at its answer is a category error.

The Scalability Problem

Even if interpretability techniques could reliably reveal a model’s true computational process, the current state of the science cannot deliver this at governance scale.

Anthropic’s attribution graph analysis succeeds on roughly a quarter of test prompts. Each successful analysis requires hours of human effort. The technique works on simple, well-defined computations and struggles with complex, multi-step reasoning.

Twenty-five percent coverage with hours of manual analysis per case is a research achievement. It is not a governance tool. Production AI systems process millions of queries. A verification method that works on one in four prompts and requires a researcher-day per analysis covers approximately zero percent of production traffic.

This limitation is not a criticism of the research. It is a statement about the distance between interpretability science and interpretability governance. The science is advancing. The distance remains large.

What This Means for Verification

The interpretability findings strengthen, rather than weaken, the case for verification infrastructure. They just change the argument.

The old argument: we need verification because AI might be wrong. The new argument: we need verification because we cannot trust AI’s own account of why it is right.

As we documented in The Verification Tax, organizations already spend nearly as much time checking AI output as they save generating it. The interpretability research explains part of why that tax persists. Workers intuitively distrust model explanations. The research validates that instinct.

And in Half Your Benchmarks Are Wrong, we showed that automated evaluation inflates AI performance scores by 24 percentage points compared to human expert judgment. The interpretability findings add a new layer: even the model’s self-reported reasoning is unreliable as an evaluation input.

The practical implication is clear. Governance frameworks must treat AI output and AI explanations as two separate artifacts, both requiring independent verification. The output needs to be checked against ground truth. The explanation needs to be treated as what it is: a plausible narrative, useful for communication, unreliable as evidence.

Building Governance That Does Not Depend on Model Honesty

The word “honesty” is imprecise here, but instructive. We do not expect calculators to explain their reasoning. We expect them to produce correct outputs, and we verify those outputs against known standards. Nobody asks a calculator why it thinks 36 plus 59 equals 95.

AI systems are different because they operate in domains where “correct” is ambiguous and context-dependent. This is exactly why the temptation to rely on explanations is strong. When you cannot easily verify the output, the explanation becomes a proxy for correctness. The interpretability research shows that proxy is unreliable.

Governance that does not depend on model explanations looks different. It looks like output verification against independent data sources. It looks like consistency checks across multiple model runs. It looks like human expert review of conclusions, not explanations. It looks like treating every model output as an untrusted input until corroborated.

This is more expensive than reading a chain-of-thought trace and deciding it looks reasonable. It is also more honest about what we actually know about how these systems work.

The interpretability research is valuable precisely because it narrows our ignorance. We now know, with mechanistic evidence, that model explanations are post-hoc rationalizations. We know that hallucinations arise from specific circuit failures, not random noise. We know that models think in concepts, not language.

Each finding makes the systems slightly less opaque. None of them makes model explanations trustworthy as governance evidence. The path forward is verification infrastructure that assumes explanations are unreliable and builds accountability anyway.

This analysis synthesizes How Anthropic’s Claude Thinks (March 2026), based on Anthropic’s published interpretability research.

Victorino Group helps enterprises build verification infrastructure for AI systems. Let’s talk.