SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Psychology, Society, and Alignment Training, RL, and Test-Time Scaling

Can LLM explanations actually help humans predict model behavior?

Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.

Synthesis note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

"Do Models Explain Themselves?" introduces a rigorous evaluation framework for model explanations: can the explanation help a human predict what the model would do on related but different inputs? If a model answers "yes" to "Can eagles fly?" with the explanation "all birds can fly," then a human would infer it also answers "yes" to "Can penguins fly?" If the model actually says "no," the explanation was imprecise — it gave the human a wrong mental model.

Two metrics operationalize this:

The key finding: precision does not correlate with plausibility. Explanations that humans judge as factually correct and logically coherent do NOT enable accurate prediction of model behavior. This means RLHF — which optimizes for human approval of explanations — will improve plausibility (explanations that look good) without improving precision (explanations that predict behavior). The model learns to generate explanations humans like, not explanations humans can use.

The second finding reinforces this: GPT-4 approximates human simulators with comparable inter-annotator agreement, and its agreement with humans is sometimes higher than human-human agreement. This validates GPT-4 as a precision evaluator but also underscores that the precision problem is not a measurement issue — it is genuine.

The implication for the CoT-as-explanation paradigm is severe. The entire interpretability case for chain-of-thought rests on the assumption that reading the trace helps users understand how the model works. But if explanation precision is low, users build incorrect mental models from CoT. Since Do chain-of-thought traces actually help users understand model reasoning?, optimizing for better-looking traces (via RLHF) will make the mental model problem worse, not better — users will be more confident in less accurate predictions.

The satisfaction-vs-faithfulness mechanism makes the RLHF prediction concrete. This note argues RLHF improves plausibility without improving precision; a user study ("Evaluating the False Trust Engendered by LLM Explanations") names the causal pathway. It draws on the finding that satisfaction — leaving the user feeling they understand the AI's reasoning — is a key property of explanations in human-AI interaction, and that RLHF-optimized models excel at producing helpful, warm, satisfying responses. Post-hoc explanations arguing for an answer's correctness therefore engender high false trust and hamper users' ability to distinguish correct from incorrect outputs, plausibly because the same RLHF optimization that drives sycophancy drives explanations that please rather than predict. So "RLHF improves plausibility not precision" is not just a metric uncorrelation but a behavioral consequence: optimizing explanations for user satisfaction is what produces persuasive-but-uninformative explanations, the precise failure this note attributes to low counterfactual simulatability.


Source (enrichment): Flaws — "Evaluating the False Trust Engendered by LLM Explanations", https://arxiv.org/abs/2605.10930

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
20 direct connections · 167 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

counterfactual simulatability of llm explanations is low and uncorrelated with plausibility — rlhf cannot fix explanation precision