Does chain of thought reasoning actually explain model decisions?
When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
The explainability promise of CoT is: by showing intermediate reasoning steps, we make the model's decision-making process transparent and understandable. The "Thoughts without Thinking" paper tests this promise in an agentic pipeline implementing a perceptive task guidance system and finds it fails in practice.
The empirical result: reviewer scores for CoT thoughts are weakly correlated with reviewer scores for responses. Incorrect responses can be preceded by apparently plausible-looking chains; incorrect chains don't reliably predict or explain incorrect responses. The chain is not doing the causal work we assume it is.
Two failure modes identified through qualitative content analysis:
The Einstellung effect: CoT rapidly gravitates toward tokens most commonly associated with a concept in training data, even when those tokens contradict the task requirements. In the dump truck assembly example: the chain starts reasoning about the toy but quickly pivots to "clutch," "transmission," "gears" — language far more common for real dump trucks than for toy assembly instructions. The chain explains what went wrong only in retrospect and only with considerable analytical effort.
Context window pressure: When context fills, the foundation model's parametric knowledge overrides RAG-retrieved context. The chain reflects this substitution but doesn't flag it as a failure.
The deeper problem: CoT produces explanations without explainability. There is more material to analyze (the chain), but that material requires considerably more interpretive effort than a single output, and may actively mislead by appearing coherent. "Generating more material" ≠ "making the system more understandable."
This extends Do language models actually use their reasoning steps? from single-model settings to agentic pipelines, where the weak correlation has direct consequences for users trying to debug or trust systems. It also connects to Do reasoning traces actually cause correct answers? — the human-like appearance of chains generates misplaced trust.
Inquiring lines that use this note as a source 11
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What is the mechanistic signature when models chain facts never presented together?
- How often do papers treat chain-of-thought as interpretability incorrectly?
- Can chain-of-thought explanations be both sufficient and necessary for model decisions?
- Can chain of thought reasoning actually validate logical arguments?
- Why do chain-of-thought outputs look logical but perform rhetorically?
- Why do models rarely admit to their actual reasoning in chain-of-thought traces?
- Does chain of thought reasoning faithfully reflect what a model actually believes?
- Are chain-of-thought traces anthropomorphizing how AI models really reason?
- Can chain of thought monitoring reliably catch model misbehavior?
- Why do production agents depend more on their surrounding pipeline than the model?
- How do thought actions represent policy improvement steps in practice?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
extends: faithfulness failure is measurable in production agentic systems, not just theoretically possible
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
agentic CoT multiplies the safety risk by adding inter-LLM trace generation
-
Do chain-of-thought traces actually help users understand model reasoning?
Chain-of-thought explanations are often presented as transparency tools, but do they genuinely improve human understanding or create an illusion of interpretability? A human-subject study tests whether traces help users follow and evaluate model reasoning.
direct support: performance ≠ interpretability; agentic pipelines make this concrete
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
same pattern: representation (or trace) exists but doesn't causally determine output
-
Can LLM explanations actually help humans predict model behavior?
Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.
the metric-level confirmation: weak thought-response correlation in agentic pipelines (this note) is the production-system manifestation of low counterfactual simulatability; users cannot predict model behavior from the explanations in either setting
-
Can formal argumentation make AI decisions truly contestable?
Explores whether structuring AI decisions as formal argument graphs (with explicit attacks and defenses) enables users to meaningfully challenge and navigate reasoning in ways unstructured LLM outputs cannot.
a potential architectural remedy: formal argumentation forces reasoning into a traversable graph of attack/defense relations, making justification structure genuinely inspectable rather than producing CoT chains that appear coherent but lack causal connection to the output
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines
- Answering Questions by Meta-Reasoning over Multiple Chains of Thought
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
- Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
- Measuring Faithfulness in Chain-of-Thought Reasoning
- DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning
- LLM Reasoning Is Latent, Not the Chain of Thought
- Chain of Thoughtlessness? An Analysis of CoT in Planning
Original note title
cot reasoning in agentic pipelines produces explanations without explainability because thought quality is weakly correlated with response quality