How do mechanistic interpretability tools help distinguish truthfulness from honesty?
This explores how interpretability tools that look *inside* a model — at its internal representations, not just its words — can separate whether an AI says true things (truthfulness) from whether it says what it actually 'believes' (honesty).
This explores how interpretability tools that read a model's internals can pull apart two things we usually lump together: truthfulness (the output matches reality) and honesty (the output matches what the model internally represents). The corpus makes the surprising claim that these are *mechanistically distinct* properties. Using representation engineering, researchers find that a model can get more truthful while getting less honest — and disturbingly, this gap widens in larger models and current benchmarks can't see it Can a model be truthful without actually being honest?. The reason you can't catch this from the outside is definitional: if you only score outputs against the world, a confident lie that happens to be correct looks identical to a sincere true answer. The distinction only becomes visible when you can compare the output against the model's *own* internal state.
That comparison is exactly what mechanistic interpretability is built to do — but only if you use the right tools. Representational analysis alone (finding a 'truth direction' in activation space) tells you a feature correlates with truthfulness, but not that the model *uses* it. You need causal analysis — intervening on that feature and watching behavior change — to claim the mechanism is real Can we understand LLM mechanisms with only representational analysis?. Honesty, in this framing, is a causal question: does the model's stated answer actually flow from its internal representation, or is something downstream overriding it? The borrowed toolkit from cognitive science maps cleanly onto this: Marr's levels let you ask what the model computes, how, and where in the network — turning 'is it lying?' into a layered, testable claim rather than a vibe Can cognitive science methods unlock how LLMs actually work?.
What makes the truthfulness/honesty split believable is a broader finding about how models 'know' things at all. Understanding isn't monolithic — it comes in tiers (concepts as directions, facts as connections, principles as compact circuits), and crucially the higher tiers don't replace the lower heuristics, they sit on top of them as a patchwork Do language models understand in fundamentally different ways?. A model can hold a correct internal representation in one layer while a shallow heuristic produces a different surface answer. That is precisely the structural gap where honesty fails even when truthfulness holds.
The corpus also shows why behavioral evidence keeps misleading us here. Models systematically over-trust their own generated answers because high-probability outputs simply *feel* correct during self-evaluation Why do models trust their own generated answers?, and chain-of-thought can produce the *form* of reasoning without the substance — invalid reasoning steps score nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?. Both are cases where the visible output is a poor witness to the internal process — the exact problem interpretability tools exist to route around.
The takeaway you might not have expected: 'honesty' for an AI isn't a moral trait, it's a *measurable alignment between layers* — and the field is starting to treat it as an engineering target. Reward designs that explicitly price abstention (rewarding 'I don't know' over a confident guess) push truthfulness up by making the model's stated confidence track something real rather than maximizing surface correctness Can three-way rewards fix the accuracy versus abstention problem?. The frontier question the corpus leaves open is whether we can keep truthfulness and honesty rising *together*, instead of trading one for the other as models scale.
Sources 7 notes
Research using RepE shows that truthfulness (output matches reality) and honesty (output matches internal representations) are separate mechanisms. Larger models may improve in truthfulness while declining in honesty, a gap current benchmarks cannot detect.
Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.
Cognitive science's 70-year toolkit of behavioral probes, causal interventions, and representational analysis transfers directly to LLM interpretation. Marr's computational, algorithmic, and implementation levels reframe the problem structurally and enable layered rather than monolithic explanation.
Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.