Does procedural knowledge drive reasoning more than factual retrieval?

Explores whether models learn reasoning through general procedures across diverse documents rather than memorizing specific facts. This matters for understanding what pretraining data actually teaches models to reason.

Synthesis note · 2026-02-22 · sourced from Training Fine Tuning

The "Procedural Knowledge in Pretraining Drives Reasoning" paper analyzes which pretraining documents most influence LLM reasoning by ranking 5 million documents by their influence on model completions. The finding: the approach to reasoning that models use is unlike retrieval. For reasoning tasks, positively influential documents contain procedural knowledge — descriptions of how to get to a solution — rather than the specific facts needed for the answer.

Three contrasts with factual recall:

Generality: models rely on a broader, more general set of documents when reasoning than when answering factual questions. Factual recall draws on a narrow set of documents containing the target fact. Reasoning draws on a diffuse set of documents performing similar procedures.
Transferability: documents have similar influence on reasoning queries that require applying the same procedure to different numbers. The procedural knowledge transfers across specific instances — it's the method, not the content, that the model has learned.
Reliance distribution: the model needs to see factual information more often (across more documents) to memorize it, while procedural patterns can be learned from fewer but more diverse demonstrations.

This connects to the knowledge/reasoning layer separation. Since Why does reasoning training help math but hurt medical tasks?, the procedural knowledge finding provides the data-level explanation for the architectural finding: lower layers store memorized facts (requiring document-specific exposure), while higher layers encode procedural strategies (learnable from general demonstrations).

The implication for training data curation: reasoning capability benefits more from diverse demonstrations of procedures than from exhaustive factual coverage. Quality and diversity of reasoning demonstrations may matter more than volume for building reasoning capability — consistent with Can models improve themselves on tasks without verifiable answers?.

Inquiring lines that use this note as a source 154

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 162 in 2-hop network ·dense cluster Open in graph ↗

Does procedural knowledge drive reasoning more t… Why does reasoning training help math but hurt med… Can models improve themselves on tasks without ver… Do base models already contain hidden reasoning ab… Do foundation models learn world models or task-sp… Can text-trained models compress images better tha…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why does reasoning training help math but hurt medical tasks? Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
architectural explanation for the data-level finding: procedural knowledge lives in higher layers
Can models improve themselves on tasks without verifiable answers? Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
consistent: small amounts of diverse procedural demonstration catalyze reasoning
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
procedural knowledge from pretraining IS the latent capability that minimal signals unlock
Do foundation models learn world models or task-specific shortcuts? When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
tension: procedural knowledge may be a form of heuristic rather than genuine reasoning
Can text-trained models compress images better than specialized tools? Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.
procedural knowledge compresses better than factual knowledge (one procedure covers many instances), directly explaining why compression = generalization is more powerful for reasoning than for factual recall

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

procedural knowledge in pretraining documents drives reasoning generalization unlike factual retrieval which requires document-specific memorization

Does procedural knowledge drive reasoning more than factual retrieval?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4