Do chain-of-thought traces actually help users understand model reasoning?
Chain-of-thought explanations are often presented as transparency tools, but do they genuinely improve human understanding or create an illusion of interpretability? A human-subject study tests whether traces help users follow and evaluate model reasoning.
A common assumption behind CoT traces: they serve as explanations. The model shows its work, users can follow the reasoning, trust is established. This assumption turns out to be wrong in a specific and quantifiable way.
Empirical findings from a 100-participant human-subject study:
- R1 traces: highest final solution accuracy, lowest human interpretability ratings
- Algorithmically-generated semantically correct traces: lowest performance despite being verifiably correct
- LLM-generated summaries of R1 traces: better interpretability, intermediate performance
The traces that are most useful for the model to generate correct answers are least useful for humans trying to understand those answers. The two objectives pull in opposite directions.
The mechanism: CoT traces used for SFT are optimized to be a training signal — to push the model toward correct token sequences through backpropagation. The properties that make a trace useful for training (complex recursive structure, non-linear exploration, self-doubt and revision cycles) are exactly the properties that make it cognitively opaque to humans.
This has a design implication that some systems are already acting on: GPT-OSS models generate a CoT trace (for model performance), a summary (for human communication), and a final answer. The trace is not shown to users. This separation acknowledges the decoupling.
The implication for AI transparency: showing users CoT traces is not showing them how the model reasons. It is showing them the model's training scaffold. What users need is a summary; what models need is the trace. Conflating the two in the name of "explainability" produces outputs that feel transparent without providing genuine interpretability.
This is a distinct claim from Do reasoning traces actually cause correct answers? — that note warns against inferring intentional reasoning from traces. This note adds: even if you don't anthropomorphize, the traces are the wrong artifact for human interpretability. Both wrong in different ways.
Controlled user-study evidence: traces don't just fail to help — they actively mislead. The interpretability-rating gap documented above measures how understandable traces feel; a between-subject user study ("Evaluating the False Trust Engendered by LLM Explanations") measures whether they improve judgment, and finds the stronger result. Showing users reasoning traces or post-hoc explanations raises their acceptance of the model's answer regardless of whether the answer is correct — the explanations are persuasive but not informative. This sharpens the decoupling claim from "traces serve the model not the user" to "traces given to the user degrade their ability to detect errors." The only explanation format that restored discrimination in that study was a contrastive dual explanation arguing both sides (see Do explanations actually help users spot AI mistakes?) — i.e., the fix is not a better one-sided trace but an artifact that argues against the model's own output.
Source (enrichment): Flaws — "Evaluating the False Trust Engendered by LLM Explanations", https://arxiv.org/abs/2605.10930
Inquiring lines that use this note as a source 4
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
traces are not verified reasoning AND are not human-interpretable; two separate failures
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
causal faithfulness and user interpretability are both absent; neither is guaranteed by the presence of a trace
-
Why do models trust their own generated answers?
Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.
models can't evaluate their own reasoning; neither can users from raw traces
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
explains why the decoupling exists: if CoT is constrained imitation of reasoning patterns from training data, traces are optimized to continue familiar token sequences (model performance) not to explain the reasoning process to humans (interpretability)
-
Does fine-tuning disconnect reasoning steps from final answers?
When models are fine-tuned on specific domains, do their chain-of-thought steps become less causally connected to their outputs? Three experiments test whether reasoning chains remain functionally faithful after training.
fine-tuning exacerbates both the faithfulness and interpretability dimensions: if traces are already decoupled from model performance (this note), and fine-tuning further decouples reasoning steps from final answers (faithfulness degradation), then post-fine-tuning traces serve neither the model nor the user
-
Does supervised fine-tuning improve reasoning or just answers?
Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
the SFT accuracy trap creates the conditions for the performance-interpretability decoupling: accuracy optimization selects for traces that drive correct outputs rather than traces that explain reasoning, directly producing the divergence documented here
-
Can LLM explanations actually help humans predict model behavior?
Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.
provides the metric-level evidence for this architectural decoupling: explanation precision (can users predict model behavior from explanations?) is uncorrelated with plausibility (do explanations look good?), confirming that RLHF-style optimization improves appearance without improving functional utility
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
- Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
- What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
- Thought Anchors: Which LLM Reasoning Steps Matter?
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
Original note title
cot traces optimize model performance, not user interpretability — the two objectives are decoupled