LLMs Know More Than They Show: Detecting Hallucinations from Inside the Model

Here is a disturbing finding: when a language model generates a factually incorrect answer, it often knows the correct answer internally. The discrepancy between what the model represents and what it outputs suggests that hallucination is not simply a failure of knowledge but a failure of expression — and that the tools to detect and correct these failures may already exist inside the model itself.

The Internal Truth Signal

Orgad, Toker, Gekhman et al. (2024), in a paper accepted at ICLR 2025, present a systematic investigation of how LLMs internally encode truthfulness information. Their key discovery is that LLM representations contain substantially more information about the truthfulness of their outputs than previously recognized — but this information is concentrated in specific tokens and varies across different types of tasks.

The first finding concerns token selection. Previous work typically probed the final token of the prompt or the first generated token to detect truthfulness signals. Orgad et al. show that truthfulness information is concentrated in the exact answer tokens — for example, in the token "Hartford" within the generation "The capital of Connecticut is Hartford." Probing the correct tokens rather than arbitrary ones dramatically improves error detection performance.

The second finding challenges a popular assumption. Several prior studies suggested the existence of a universal "truthfulness direction" in LLM activation space — a single vector that separates truthful from untruthful representations across all tasks. Orgad et al. demonstrate that this universality does not hold. Truthfulness encoding is skill-specific: a probing classifier trained to detect errors in factual retrieval tasks does not generalize to sentiment analysis or reasoning tasks. LLMs encode multiple, distinct notions of truth rather than a single unified concept.

The third and most striking finding reveals a fundamental misalignment between internal encoding and external behavior. In a significant fraction of cases, the model's internal representations encode the correct answer while the model consistently generates an incorrect one. This discrepancy suggests that hallucination reduction might be achievable by better aligning output generation with already-existing internal knowledge, rather than injecting new knowledge from external sources.

From Universal Vectors to Query-Specific Correction

If truthfulness encoding is not universal but varies across queries, then applying a single correction vector to all inputs is inherently limited. Wang, Cao, Cao, and Chen (2025) address this directly with TruthFlow, a method that generates query-specific truthful representation corrections using Flow Matching — a technique that learns smooth transformations between probability distributions.

TruthFlow first trains a flow model to learn correction vectors that transition LLM representations from hallucinated states to truthful states. During inference, the trained model takes any specific query's representations as input and generates a tailored correction vector. This approach breaks the assumption embedded in prior methods like Inference-Time Intervention (ITI) that a single adjustment can fix all hallucinations.

The method includes a truth-related subspace projection step that filters noise from query representations before applying corrections. Experiments on TruthfulQA demonstrate that TruthFlow enhances truthfulness across multiple LLMs, particularly for open-ended generation tasks where existing methods struggle. The trained model also exhibits strong transferability, performing effectively on hallucination benchmarks it was not trained on.

The Emerging Architecture of Truthfulness Control

These works, together with parallel research on SAE-based knowledge selection steering by Zhao, Devoto, and Hong (2024), reveal an emerging architecture for real-time truthfulness control in LLMs:

Detection layer. Probe internal representations at the exact answer tokens to identify when the model is about to generate content that contradicts its own internal knowledge. The skill-specific nature of truthfulness encoding means that different probes may be needed for different task types.

Correction layer. Apply query-specific representation adjustments — not universal vectors — to shift the model's output distribution toward its internally encoded truthful response. Flow-based methods provide a principled framework for learning these corrections.

Verification layer. Monitor whether the correction successfully aligned external output with internal representation, using the same probing techniques that detected the original discrepancy.

Why This Matters

The practical implications are immediate. If LLMs frequently "know" the correct answer but generate incorrect ones, then the hallucination problem is more tractable than it appears from the outside. Rather than requiring models to learn new information, we may only need to improve the fidelity of the path from internal representation to output.

For high-stakes applications — medical diagnosis, legal analysis, financial advice — the ability to detect hallucinations in real time using the model's own internal signals, without relying on external fact-checking databases, would represent a significant advance in trustworthy AI.

Open Questions

The finding that truthfulness encoding is skill-specific raises questions about how many distinct "truth detectors" would be needed for comprehensive coverage. Is there a practical taxonomy of truthfulness types, or does the number of required probes scale with the diversity of tasks?

The internal-external misalignment finding is equally provocative. Why does the model generate incorrect answers when it internally encodes correct ones? Understanding this mechanism — whether it relates to output distribution biases, attention pattern failures, or decoding strategy limitations — could unlock more targeted interventions than representation-level corrections.

Looking Forward

The convergence of internal truthfulness probing, query-specific correction, and SAE-based knowledge steering points toward a future where LLMs are equipped with built-in truthfulness monitoring — not as an external guardrail but as an integral part of the generation process. The tools for detecting and correcting hallucinations may already be encoded in the very representations that produce them.