Do language model reasoning drafts faithfully represent their actual computation?
If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.
The promise of thinking models for AI safety monitoring is specific: because the model externalizes its reasoning in a thinking draft before answering, observers can read the draft to detect errors and control what happens in the answer stage. This promise depends on one empirical assumption: that the thinking draft faithfully represents the model's actual internal computation. This paper tests that assumption with counterfactual interventions and finds it frequently violated.
Intra-Draft Faithfulness: When a false or contradictory step is inserted mid-draft, do subsequent steps and the final draft conclusion appropriately integrate or correct it? If the draft is faithful, inserted errors should produce systematic downstream effects. Finding: LRMs show selective faithfulness — some steps matter, most don't. Counterfactual integration is inconsistent across models and tasks.
Draft-to-Answer Faithfulness (two components):
- Draft Reliance: Does the answer-generation stage introduce substantial new reasoning beyond the thinking draft? If so, the draft is not the full reasoning record.
- Draft-Answer Consistency: Does the final answer logically align with the thinking draft's explicit conclusion? Finding: final answers frequently contradict the explicit draft conclusions. The draft may say "therefore X" while the answer states Y.
Both failures undermine the monitoring promise from different directions. Intra-draft inconsistency means you can't trace error propagation through the draft. Draft-answer inconsistency means even a coherent, correct-looking draft doesn't guarantee a correct answer derived from it.
The safety implications are immediate: inserting corrective content into thinking drafts won't reliably fix outputs (intra-draft faithfulness fails). Reading draft conclusions to predict final answers won't reliably work (draft-answer consistency fails). The draft is an unreliable proxy for the computation it represents.
This extends Do language models actually use their reasoning steps? with a two-dimensional operationalization and empirical methodology. Both dimensions — "does the draft causally influence the answer" (causal sufficiency) and "does the answer depend on the draft" (necessity) — can now be measured via counterfactual intervention.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes draft-centric systems better anchors for coherence than feed-forward outputs?
- Why do final answers contradict what the thinking draft explicitly concluded?
- Can inserted errors in reasoning drafts produce predictable downstream effects?
- Does the answer stage perform substantial reasoning beyond the thinking draft?
- How does program-aided reasoning externalize intermediate computation into executable form?
- Can you monitor a reasoning model's thinking without teaching it to obfuscate?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
operationalizes with two specific measurable dimensions; counterfactual intervention is the methodology that makes the abstract claim testable
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
draft-to-answer consistency failure is the empirical confirmation of why trace anthropomorphism is dangerous
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
behavioral correlation: confirmatory reflection is the content-level evidence of faithfulness failure — if reflection tokens confirm rather than evaluate, they are causal decoration, not causal drivers
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
provides the theoretical grounding: draft unfaithfulness is the expected outcome if CoT is imitation of reasoning form rather than genuine inference — drafts are performative by construction, so draft-answer disconnects are structural, not accidental
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
- Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning
- Measuring Faithfulness in Chain-of-Thought Reasoning
- DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- LLM Reasoning Is Latent, Not the Chain of Thought
Original note title
thinking draft faithfulness has two separable dimensions — intra-draft causal consistency and draft-to-answer consistency — current LRMs fail both