SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Can reconstructing expert thinking improve reasoning transfer?

Expert texts show only the final result of complex thinking. Can we reverse-engineer those hidden thought processes and use them to train models that reason better across different domains?

Synthesis note · 2026-05-03 · sourced from Data

Standard reasoning training uses supervised fine-tuning or reinforcement learning, which require task-specific signals (math correctness, code execution) and therefore cannot scale across domains where verifiable feedback is unavailable. Continual pretraining (CPT) avoids this constraint but provides no reasoning signal — the model just sees more text. Reasoning CPT proposes a third path: every expert text (a math proof, a legal opinion) is the visible result of an underlying thought process involving trial, hypothesis, recall, and verification, and that hidden thought process can be reconstructed as synthetic data — the same surface-vs-process distinction that drives Why do language models need so much more text than humans?.

The reconstruction targets four characteristic aspects of expert thinking: human-like spontaneous expressions ("Hmm... ", "Aha!"), background knowledge recall (internally retrieving relevant rules), decision-making (considering an action), and self-verification (checking for omissions). The synthetic training sequence concatenates the original text with its reconstructed hidden thoughts, giving the model both the visible result and the implicit process behind it.

Three findings distinguish this from standard CPT. First, cross-domain transfer: training hidden thoughts from law improves not just MMLU social sciences but MMLU-STEM by 4.3 points, because the reasoning skill — not the domain knowledge — transfers. Second, the gap widens with difficulty: on the hardest MMLU problems, Reasoning CPT reaches 51.8-52.5% accuracy versus 43.9-44.6% for CPT, a roughly 8-point advantage. Third, models automatically adjust reasoning length to problem difficulty — short for easy, long for hard — without explicit instruction.

A plausible mechanism for the adaptive reasoning length: the training corpus shows positive correlation between original-text length and hidden-thought length (Spearman ρ = 0.348 STEM, 0.486 Law). The model learns a heuristic — continue thinking until enough evidence accumulates to confidently predict the next token — which produces short chains for easy questions and long chains for hard ones. The implication is that overthinking and underthinking are both consequences of training on text that does not reveal its own thinking-effort calibration.


Inquiring lines that use this note as a source 12

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 186 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

expert texts are surface residues of hidden thought processes — and reconstructing those processes for pretraining produces cross-domain reasoning transfer impossible in standard CPT