Does procedural knowledge drive reasoning more than factual retrieval?
Explores whether models learn reasoning through general procedures across diverse documents rather than memorizing specific facts. This matters for understanding what pretraining data actually teaches models to reason.
The "Procedural Knowledge in Pretraining Drives Reasoning" paper analyzes which pretraining documents most influence LLM reasoning by ranking 5 million documents by their influence on model completions. The finding: the approach to reasoning that models use is unlike retrieval. For reasoning tasks, positively influential documents contain procedural knowledge — descriptions of how to get to a solution — rather than the specific facts needed for the answer.
Three contrasts with factual recall:
Generality: models rely on a broader, more general set of documents when reasoning than when answering factual questions. Factual recall draws on a narrow set of documents containing the target fact. Reasoning draws on a diffuse set of documents performing similar procedures.
Transferability: documents have similar influence on reasoning queries that require applying the same procedure to different numbers. The procedural knowledge transfers across specific instances — it's the method, not the content, that the model has learned.
Reliance distribution: the model needs to see factual information more often (across more documents) to memorize it, while procedural patterns can be learned from fewer but more diverse demonstrations.
This connects to the knowledge/reasoning layer separation. Since Why does reasoning training help math but hurt medical tasks?, the procedural knowledge finding provides the data-level explanation for the architectural finding: lower layers store memorized facts (requiring document-specific exposure), while higher layers encode procedural strategies (learnable from general demonstrations).
The implication for training data curation: reasoning capability benefits more from diverse demonstrations of procedures than from exhaustive factual coverage. Quality and diversity of reasoning demonstrations may matter more than volume for building reasoning capability — consistent with Can models improve themselves on tasks without verifiable answers?.
Inquiring lines that use this note as a source 154
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does AI-assisted learning create the Knowledge Custodian paradox in practice?
- How does instrumental reasoning reproduce pre-Enlightenment knowledge structures?
- How does surface salience compete with background knowledge in model inference?
- When does knowledge activation fail across different model architectures?
- What distinguishes planning knowledge from an executable plan that works?
- How does the knowing-doing gap widen as tasks become more complex?
- Why do single examples trigger large reasoning improvements in models?
- What distinguishes genuine reasoning activation from memorization-assisted answer recall?
- What makes a background condition relevant to a specific reasoning task?
- Can retrieval improve multi-step reasoning by triggering at each uncertainty?
- Can models learn when to invoke search during reasoning tasks?
- Why must procedural skills consolidate before strategic reasoning can develop?
- Can prompting inject new knowledge into already-trained AI models?
- How much does organized knowledge improve learning efficiency versus raw data?
- Why does explicit theory injection work better than example-based learning for reasoning tasks?
- What makes reasoning capability a pre-training rather than post-training phenomenon?
- Can causal models be extended to include non-causal cognition?
- How does cognitive fit theory explain why different tasks need different knowledge structures?
- Can prompting unlock compositional skills that pretraining already learned?
- Do explicit reasoning chains improve or harm performance on complex judgment tasks?
- Can models learn to select exemplars based on reasoning skills rather than complexity?
- What behavioral markers signal when reasoning chains are performative?
- How much does pre-training frequency predict reasoning task performance?
- How much does prompt format shape what reasoning strategy a model uses?
- How much does training data format shape what reasoning strategy emerges?
- Why does training format shape reasoning strategy more than domain?
- Are larger models and search access substitutes for factual accuracy?
- Why do models show performative reasoning on easy tasks but genuine reasoning on hard ones?
- What happens to professional expertise when judgment gets encoded into systems?
- Can testing prior knowledge and checking understanding improve explanation outcomes?
- Why does training data format shape reasoning strategy more than domain content?
- Why does general reasoning not transfer to knowledge-intensive medical domains?
- Can verifier-guided search catch factual errors that reasoning training cannot?
- Does constraining AI access during early task phases preserve skill formation?
- Can episodic and semantic memory improve long-horizon task reasoning?
- Can reasoning skills trained on law improve performance in STEM?
- How does cross-domain reasoning transfer differ from domain-specific knowledge transfer?
- Can latent reasoning in continuous space scale beyond supervised reasoning tasks?
- Can extended reasoning training capture individual strategic thinking styles?
- What makes knowledge editing different from simply finding where facts are stored?
- How do we distinguish knowledge encoding from knowledge usage in models?
- Does training data format shape model reasoning more than domain content?
- What makes clinical theory grounding more effective than pattern matching alone?
- Does knowledge structure matter more than knowledge volume for model training?
- What is the difference between procedural knowledge and factual retrieval in reasoning?
- What are retrieval heads and why do they matter for reasoning?
- Does model scaling improve knowledge storage faster than reasoning ability?
- How does dual-rate learning separate episodic and procedural memory in neural networks?
- Why does describing a process differ fundamentally from arguing about evidence?
- How does behavioral fine-tuning differ from factual knowledge encoding in models?
- How does training data distribution create asymmetric competence across relation types?
- How can entailment benchmarks separate genuine reasoning from memorization effects?
- Do models with unfilled memorization capacity appear to generalize falsely?
- Why is extracting training data insufficient proof that models memorize?
- What makes knowledge-rich specialized domains structurally different from general reasoning tasks?
- How do expert priors constrain human researchers from exploring novel concepts?
- How do foundation models develop task-specific heuristics instead of world models?
- Can curriculum degradation of document quality accelerate policy learning?
- Why do vector embeddings fail for sequential procedural retrieval tasks?
- Why do models learn reasoning form instead of actual abstract inference?
- How does training format shape reasoning strategy more than content?
- How much does input format shape what reasoning strategy a model develops?
- What distinguishes conceptual understanding from statistical pattern matching in models?
- How can prompting help models gather information before attempting reasoning?
- How does inductive reasoning from partial evidence enable hypothesis formation?
- Do depth thresholds correspond to transitions between procedural and strategic learning?
- Why does imitation learning create a ceiling for reasoning capability?
- Can targeted activation steering surface latent reasoning in base models?
- How much does training composition affect syntactic versus reasoning performance?
- How do retrieved memories differ from decision-context passages for prediction?
- Why does the same recalled information lead to different reasoning conclusions?
- How much does training data presentation format shape reasoning ability?
- Can models distinguish between activated knowledge and genuine reasoning?
- How does an instruction-following LLM activate latent retrieval knowledge?
- Do reasoning systems reuse cognitive structures across unrelated topics?
- Why do recursive belief models require different training than logical derivation?
- Can models trained on longer contexts develop better fundamental reasoning abilities?
- What separates knowledge from reasoning in neural network layers?
- Why do format and structure matter more than actual content in reasoning?
- What role does inductive bias play versus model capacity in practice?
- How do retrieval heads interact with layer-level separation of knowledge and reasoning?
- What role does curriculum design play in reasoning emergence?
- How does a single training example trigger phase transitions in reasoning output?
- Does RL pruning of documents differ fundamentally from rationale-driven evidence selection?
- Why do models follow a two-phase pattern of procedural then strategic learning?
- Does verbal step-by-step reflection preserve learning signals that abstraction removes?
- Why do pretrained model priors reduce the usefulness of retrieved experience?
- What role does a model's representational structure play in learning?
- Why does the generation-verification gap disappear for factual recall tasks?
- Why do SFT models memorize patterns instead of learning generalizable reasoning?
- How does self-referential processing transfer to other reasoning tasks?
- How does training data format shape which reasoning patterns emerge in models?
- Why do foundation models develop task-specific heuristics instead of causal understanding?
- Does RL teach models when to use reasoning or how to reason?
- What distinguishes memorized tokens from causally necessary reasoning steps?
- Why does training data format shape reasoning strategy more than content?
- Why does reasoning training improve math but hurt knowledge tasks?
- Can the joint-training principle extend beyond memorization and generalization pairs?
- What's the difference between representing world facts and generating world mechanisms?
- Does representational density emerge from training data exposure during pretraining?
- Can operationalizing theory into prompt structure improve reasoning more than theory itself?
- Can simple structure perturbations reliably expose memorization in reasoning models?
- What makes language an effective parameterization for procedural knowledge?
- How do task stream groupings provide long-horizon learning signals for curation decisions?
- What structural differences emerge between early generic skills and later meta-strategy skills?
- Do interaction effects between research mechanisms depend on the task domain?
- What is the distinction between teaching reasoning how versus when to activate?
- Can pretraining signals unlock latent reasoning that post-training merely activates?
- Why does eliminating proxy-model filtering improve reasoning emergence in pretraining?
- Why does semantic similarity retrieval enable skill transfer to novel situations?
- Why do reasoning tasks improve more than retrieval from lookup memory?
- Does latent reasoning capability exist in base models before any training?
- What distinguishes reasoning activation mechanisms across different training methods?
- Does training data format shape reasoning strategy more than domain content?
- How does backward reasoning during training improve forward reasoning capability?
- What distinguishes data that generalizes broadly from task-specific memorization?
- Do reasoning models fail to report processes that actually influence their answers?
- Does the pretrained prior actually constrain what internalized search can discover?
- How do timing and search internalization interact during reasoning post-training?
- How does continuous implicit memory formation differ from explicit memory encoding?
- Why does reasoning transfer across different numbers but factual recall does not?
- How many document exposures does procedural knowledge versus factual information require?
- Can smaller amounts of diverse reasoning demonstrations replace exhaustive factual training data?
- Why do higher network layers capture procedural knowledge but lower layers store facts?
- Does RL primarily teach when to use reasoning or how to reason?
- Can format adaptation alone explain why reasoning enrichment improves instruction following?
- Does token-level reasoning during pretraining improve general reasoning without task-specific supervision?
- Can models possess latent reasoning capability that training signals fail to unlock?
- Why do knowledge and reasoning train in different network layers?
- How can structured reasoning templates serve as rewards for code agent training?
- Does argument-scheme prompting improve reasoning in non-code domains the same way?
- Can models recover knowledge with completely unrelated retraining tasks?
- What kinds of reasoning tasks reveal the ceiling of text-only training?
- Do text-space skills transfer learning across different frontier models?
- How does treating cognition as computation reshape education and work?
- Why does pre-training provide the raw material for emergent thinking?
- How much does training data format influence reasoning strategy versus domain content?
- Why do students learn better from explanations than from solving problems from scratch?
- What mechanisms activate latent reasoning capabilities already present in base models?
- How does training data structure shape reasoning strategy more than domain content?
- Is reasoning failure caused by task complexity or training distribution gaps?
- Can structured workflows unlock latent reasoning abilities that raw models don't show?
- What makes knowledge seeding equivalent to hippocampal replay in the brain?
- Why does in-weight memorization fail compared to tool-based fact access?
- Can articulating latent reasoning processes improve transfer across domains?
- Does task diversity in pretraining data transfer reasoning better than larger models?
- Can minimal training signals unlock latent reasoning capability in base models?
- How does o1-style reasoning relate to learned search processes versus memorized solutions?
- How does representational density emerge from training data familiarity?
- What makes procedural knowledge in documents generalize better than facts?
- Can minimal training signals unlock reasoning already latent in pretrained representations?
- Can small demonstration sets unlock general reasoning without large question data?
- How does question difficulty and breadth affect what models learn to reason?
- What makes factual memorization less efficient than tool-based retrieval?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
architectural explanation for the data-level finding: procedural knowledge lives in higher layers
-
Can models improve themselves on tasks without verifiable answers?
Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
consistent: small amounts of diverse procedural demonstration catalyze reasoning
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
procedural knowledge from pretraining IS the latent capability that minimal signals unlock
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
tension: procedural knowledge may be a form of heuristic rather than genuine reasoning
-
Can text-trained models compress images better than specialized tools?
Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.
procedural knowledge compresses better than factual knowledge (one procedure covers many instances), directly explaining why compression = generalization is more powerful for reasoning than for factual recall
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
- Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors
- RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
- Eliciting Reasoning in Language Models with Cognitive Tools
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
Original note title
procedural knowledge in pretraining documents drives reasoning generalization unlike factual retrieval which requires document-specific memorization