SYNTHESIS NOTE
Model Architecture and Internals Reasoning, Retrieval, and Evaluation

Why does reasoning training help math but hurt medical tasks?

Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection
How do you build domain expertise into general AI models? How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The Decoupling Knowledge and Reasoning paper proposes a testable two-phase model of LLM inference by contrasting fast thinking (no chain-of-thought) with slow thinking (CoT-enabled). Fast thinking engages Phase 1 only: knowledge retrieval from lower network layers. Slow thinking adds Phase 2: reasoning adjustment in higher layers. Comparing the two isolates each phase's contribution.

Across 15 LLMs on 3 datasets, three findings:

Domain-specificity of reasoning benefit: Phase 2 (reasoning adjustment) helps math, physics, and chemistry but can impair performance on knowledge-intensive domains. In medical tasks, the Phase 1 knowledge retrieved may be more reliable than the Phase 2 reasoning applied on top of it — reasoning adjustment introduces error rather than correcting it.

Scaling asymmetry: parameter scaling improves both phases, but knowledge improvement (Phase 1) dominates. Larger models know more, and this knowledge advantage outpaces the reasoning advantage. Scaling makes models more "prudent" (better at not making errors) across all domains, but only "more intelligent" (better at novel inference) in reasoning-intensive ones.

Layer localization: knowledge retrieval is primarily a lower-layer phenomenon; reasoning adjustment operates in higher layers. This is a functional architectural separation — not just a behavioral one.

The layer localization provides the mechanistic explanation for the SFT knowledge gap. CoT fine-tuning and RLVR modify higher-layer behavior. They cannot improve the lower-layer knowledge encoding that knowledge-intensive tasks depend on. Adding reasoning training to a model that lacks medical knowledge won't close the knowledge gap — it modifies a layer that isn't the bottleneck.

Architectural evidence for layer redundancy: The "Unreasonable Ineffectiveness of the Deeper Layers" (2403.17887) provides striking corroboration. Up to half of LLM layers can be pruned with minimal degradation on question-answering benchmarks, using a simple strategy: identify optimal block of layers to prune by cross-layer similarity, then heal with QLoRA finetuning on a single A100 GPU. This implies either that current pretraining methods are not properly leveraging the parameters in deeper layers, or that shallow layers play a disproportionately critical role in storing knowledge. Both interpretations reinforce the functional separation: if knowledge resides in lower layers, the deeper layers' contribution may be primarily redundant refinement rather than essential computation.

Retrieval heads as mechanistic evidence: The "Retrieval Head" paper provides direct causal evidence for layer specialization. A sparse set of attention heads (<5%) are responsible for retrieving relevant information from long context. These retrieval heads are: (1) universal across model families, (2) intrinsic — they exist in short-context models and persist through context-length extension, (3) dynamically activated — some always attend to required information while others activate contextually, and (4) causal — pruning them causes hallucination while pruning non-retrieval heads has no effect. Retrieval heads strongly influence CoT reasoning (which requires referring back to prior context) but minimally affect tasks where the model generates from intrinsic knowledge. This is a specific mechanistic instantiation of the lower-layer knowledge retrieval function described above. See What mechanism enables models to retrieve from long context?.

Latent concept hierarchy: The "Discovering Latent Concepts Learned in BERT" (2205.07237) confirms the layer hierarchy from a representation perspective. Lower layers dominate in learning shallow lexical concepts, while higher layers learn semantic relations. Critically, BERT learns novel concepts (e.g., animal categories, demographic groups) that do not adhere to predefined categorizations — the model discovers its own organizational structure. Several latent concepts are based on multiple properties spanning semantics, syntax, and morphology simultaneously, suggesting the layer separation is not clean but follows a general gradient.

The "Procedural Knowledge in Pretraining Drives Reasoning" paper provides the data-level explanation that complements this architectural finding. By ranking 5 million pretraining documents by their influence on model completions, they show that reasoning draws on a diffuse set of documents containing procedural knowledge (descriptions of how to solve), while factual recall draws on narrow document sets containing the target fact. This maps directly onto the layer separation: lower layers store memorized facts (requiring document-specific exposure), while higher layers encode procedural strategies (learnable from general demonstrations of method). See Does procedural knowledge drive reasoning more than factual retrieval?.

Inquiring lines that use this note as a source 57

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
23 direct connections · 201 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

knowledge resides in lower network layers and reasoning in higher layers — this functional separation explains why reasoning training helps math but can impair knowledge-intensive domains