Do language model reasoning drafts faithfully represent their actual computation?

If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection

The promise of thinking models for AI safety monitoring is specific: because the model externalizes its reasoning in a thinking draft before answering, observers can read the draft to detect errors and control what happens in the answer stage. This promise depends on one empirical assumption: that the thinking draft faithfully represents the model's actual internal computation. This paper tests that assumption with counterfactual interventions and finds it frequently violated.

Intra-Draft Faithfulness: When a false or contradictory step is inserted mid-draft, do subsequent steps and the final draft conclusion appropriately integrate or correct it? If the draft is faithful, inserted errors should produce systematic downstream effects. Finding: LRMs show selective faithfulness — some steps matter, most don't. Counterfactual integration is inconsistent across models and tasks.

Draft-to-Answer Faithfulness (two components):

Draft Reliance: Does the answer-generation stage introduce substantial new reasoning beyond the thinking draft? If so, the draft is not the full reasoning record.
Draft-Answer Consistency: Does the final answer logically align with the thinking draft's explicit conclusion? Finding: final answers frequently contradict the explicit draft conclusions. The draft may say "therefore X" while the answer states Y.

Both failures undermine the monitoring promise from different directions. Intra-draft inconsistency means you can't trace error propagation through the draft. Draft-answer inconsistency means even a coherent, correct-looking draft doesn't guarantee a correct answer derived from it.

The safety implications are immediate: inserting corrective content into thinking drafts won't reliably fix outputs (intra-draft faithfulness fails). Reading draft conclusions to predict final answers won't reliably work (draft-answer consistency fails). The draft is an unreliable proxy for the computation it represents.

This extends Do language models actually use their reasoning steps? with a two-dimensional operationalization and empirical methodology. Both dimensions — "does the draft causally influence the answer" (causal sufficiency) and "does the answer depend on the draft" (necessity) — can now be measured via counterfactual intervention.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 128 in 2-hop network ·medium cluster Open in graph ↗

Do language model reasoning drafts faithfully re… Do language models actually use their reasoning st… Do reasoning traces actually cause correct answers… Does reflection in reasoning models actually corre… Does chain-of-thought reasoning reveal genuine inf…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
operationalizes with two specific measurable dimensions; counterfactual intervention is the methodology that makes the abstract claim testable
Do reasoning traces actually cause correct answers? Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
draft-to-answer consistency failure is the empirical confirmation of why trace anthropomorphism is dangerous
Does reflection in reasoning models actually correct errors? When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
behavioral correlation: confirmatory reflection is the content-level evidence of faithfulness failure — if reflection tokens confirm rather than evaluate, they are causal decoration, not causal drivers
Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
provides the theoretical grounding: draft unfaithfulness is the expected outcome if CoT is imitation of reasoning form rather than genuine inference — drafts are performative by construction, so draft-answer disconnects are structural, not accidental

Do language model reasoning drafts faithfully represent their actual computation?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4