SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals Training, RL, and Test-Time Scaling

What do models actually learn from chain-of-thought training?

When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.

Synthesis note · 2026-02-22 · sourced from Reasoning Critiques
How should we allocate compute budget at inference time?

When training on reasoning demonstrations, what actually gets learned? Controlled ablations reveal a striking asymmetry: models are highly tolerant to content errors but highly sensitive to structural disruption.

Two types of perturbation were applied to Long CoT training samples:

Content perturbations (model is mostly unaffected):

Structural perturbations (model is severely affected):

What models learn from reasoning demonstrations is not what to think but how to structure thinking: the pattern of reflection, backtracking, and self-validation that makes long CoT effective. The specific facts, numbers, and even the correctness of individual steps are secondary. The logical architecture — which steps precede which, how contradiction leads to backtracking, how intermediate validation is structured — is primary.

This partially explains why distillation from a larger reasoning model to a smaller one works even with relatively few samples (17k samples showed substantial gains): the small model is not memorizing the reasoning content, it is acquiring the structural pattern of how reasoning unfolds. Structure is cheap to transmit.

This deepens Does training data format shape reasoning strategy more than domain? — that finding was that training format (multiple choice vs fill-in) shapes strategy more than domain. This finding shows the same principle operating at a finer scale: within a Long CoT format, structural coherence matters more than content correctness. Format dominance operates at multiple levels.

The practical implication: generating training data for reasoning models does not require perfect reasoning. It requires structurally coherent reasoning — chains with correct logical architecture, even if specific steps contain errors.

FOL-based validation confirms the coherence/validity distinction: Analysis of RLVR-trained models using first-order logic error taxonomy shows that RLVR improves local trace coherence — transitions between adjacent steps become more logically consistent — without guaranteeing global mathematical validity. The models produce traces that read as better reasoning (fewer non-sequiturs, more explicit intermediate steps) but the improvement is structural, not semantic. Local consistency gains should not be mistaken for improved mathematical proof capability. This provides formal grounding for the structural-over-content principle: what RLVR optimizes is the architecture of reasoning, not its truth-preserving properties. See Does RLVR actually improve mathematical reasoning or just coherence?.

Molecular bond taxonomy specifies what kind of structure matters: The Molecular Structure of Thought paper decomposes Long CoT structure into three interaction types: Deep-Reasoning (covalent bonds — dense local deduction clusters), Self-Reflection (hydrogen bonds — long-range corrective links), and Self-Exploration (van der Waals forces — weak bridges between distant clusters). This provides the specific structural vocabulary for why coherence matters: effective reasoning requires the right distribution of these bond types, and "semantic isomers" (same semantic content, different bond distributions) from different teachers destabilize learning when mixed — even with matched token statistics. See Does long chain of thought reasoning follow molecular bond patterns?.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
19 direct connections · 133 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

long cot learning is driven by structural coherence, not content correctness