Can reconstructing expert thinking improve reasoning transfer?

Expert texts show only the final result of complex thinking. Can we reverse-engineer those hidden thought processes and use them to train models that reason better across different domains?

Synthesis note · 2026-05-03 · sourced from Data

Standard reasoning training uses supervised fine-tuning or reinforcement learning, which require task-specific signals (math correctness, code execution) and therefore cannot scale across domains where verifiable feedback is unavailable. Continual pretraining (CPT) avoids this constraint but provides no reasoning signal — the model just sees more text. Reasoning CPT proposes a third path: every expert text (a math proof, a legal opinion) is the visible result of an underlying thought process involving trial, hypothesis, recall, and verification, and that hidden thought process can be reconstructed as synthetic data — the same surface-vs-process distinction that drives Why do language models need so much more text than humans?.

The reconstruction targets four characteristic aspects of expert thinking: human-like spontaneous expressions ("Hmm... ", "Aha!"), background knowledge recall (internally retrieving relevant rules), decision-making (considering an action), and self-verification (checking for omissions). The synthetic training sequence concatenates the original text with its reconstructed hidden thoughts, giving the model both the visible result and the implicit process behind it.

Three findings distinguish this from standard CPT. First, cross-domain transfer: training hidden thoughts from law improves not just MMLU social sciences but MMLU-STEM by 4.3 points, because the reasoning skill — not the domain knowledge — transfers. Second, the gap widens with difficulty: on the hardest MMLU problems, Reasoning CPT reaches 51.8-52.5% accuracy versus 43.9-44.6% for CPT, a roughly 8-point advantage. Third, models automatically adjust reasoning length to problem difficulty — short for easy, long for hard — without explicit instruction.

A plausible mechanism for the adaptive reasoning length: the training corpus shows positive correlation between original-text length and hidden-thought length (Spearman ρ = 0.348 STEM, 0.486 Law). The model learns a heuristic — continue thinking until enough evidence accumulates to confidently predict the next token — which produces short chains for easy questions and long chains for hard ones. The implication is that overthinking and underthinking are both consequences of training on text that does not reveal its own thinking-effort calibration.

Inquiring lines that use this note as a source 12

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 186 in 2-hop network ·dense cluster Open in graph ↗

Can reconstructing expert thinking improve reaso… Why do language models need so much more text than… Can chain-of-thought reasoning be learned during p… Can next-token prediction become a reasoning task … Do base models already contain hidden reasoning ab… Does AI text generation unfold through temporal re…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do language models need so much more text than humans? Language models train on the surface of written text, but humans learn by inferring the underlying thoughts behind what they read. Does this explain why models need vastly more data to reach human-level understanding?
extends: companion piece — same compressed-surface diagnosis applied at the pretraining-data level instead of the inference level
Can chain-of-thought reasoning be learned during pretraining itself? Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.
complements: RPT and Reasoning CPT both train reasoning at pretraining time but with different signals — information-gain reward vs reconstructed hidden thoughts
Can next-token prediction become a reasoning task with RL? Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.
complements: RPT generalizes reasoning to any domain via RL on next-token; this note generalizes via reconstructed thoughts; both attack domain-specificity of reasoning training
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
complements: hidden-thought reconstruction as a way of activating latent capability without RLVR's verifiability requirement
Does AI text generation unfold through temporal reflection? Explores whether the sequential ordering of tokens in LLM generation constitutes genuine temporal thought or merely probabilistic computation without reflective duration.
tension: reconstructed thoughts add a quasi-temporal trace ("Hmm... Aha!") to training data, but surface markers of temporal cognition do not actually install temporality

Can reconstructing expert thinking improve reasoning transfer?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4