SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Psychology, Society, and Alignment

Do language model reasoning drafts faithfully represent their actual computation?

If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The promise of thinking models for AI safety monitoring is specific: because the model externalizes its reasoning in a thinking draft before answering, observers can read the draft to detect errors and control what happens in the answer stage. This promise depends on one empirical assumption: that the thinking draft faithfully represents the model's actual internal computation. This paper tests that assumption with counterfactual interventions and finds it frequently violated.

Intra-Draft Faithfulness: When a false or contradictory step is inserted mid-draft, do subsequent steps and the final draft conclusion appropriately integrate or correct it? If the draft is faithful, inserted errors should produce systematic downstream effects. Finding: LRMs show selective faithfulness — some steps matter, most don't. Counterfactual integration is inconsistent across models and tasks.

Draft-to-Answer Faithfulness (two components):

Both failures undermine the monitoring promise from different directions. Intra-draft inconsistency means you can't trace error propagation through the draft. Draft-answer inconsistency means even a coherent, correct-looking draft doesn't guarantee a correct answer derived from it.

The safety implications are immediate: inserting corrective content into thinking drafts won't reliably fix outputs (intra-draft faithfulness fails). Reading draft conclusions to predict final answers won't reliably work (draft-answer consistency fails). The draft is an unreliable proxy for the computation it represents.

This extends Do language models actually use their reasoning steps? with a two-dimensional operationalization and empirical methodology. Both dimensions — "does the draft causally influence the answer" (causal sufficiency) and "does the answer depend on the draft" (necessity) — can now be measured via counterfactual intervention.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 128 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

thinking draft faithfulness has two separable dimensions — intra-draft causal consistency and draft-to-answer consistency — current LRMs fail both