Why does reasoning training help math but hurt medical tasks?

Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection

The Decoupling Knowledge and Reasoning paper proposes a testable two-phase model of LLM inference by contrasting fast thinking (no chain-of-thought) with slow thinking (CoT-enabled). Fast thinking engages Phase 1 only: knowledge retrieval from lower network layers. Slow thinking adds Phase 2: reasoning adjustment in higher layers. Comparing the two isolates each phase's contribution.

Across 15 LLMs on 3 datasets, three findings:

Domain-specificity of reasoning benefit: Phase 2 (reasoning adjustment) helps math, physics, and chemistry but can impair performance on knowledge-intensive domains. In medical tasks, the Phase 1 knowledge retrieved may be more reliable than the Phase 2 reasoning applied on top of it — reasoning adjustment introduces error rather than correcting it.

Scaling asymmetry: parameter scaling improves both phases, but knowledge improvement (Phase 1) dominates. Larger models know more, and this knowledge advantage outpaces the reasoning advantage. Scaling makes models more "prudent" (better at not making errors) across all domains, but only "more intelligent" (better at novel inference) in reasoning-intensive ones.

Layer localization: knowledge retrieval is primarily a lower-layer phenomenon; reasoning adjustment operates in higher layers. This is a functional architectural separation — not just a behavioral one.

The layer localization provides the mechanistic explanation for the SFT knowledge gap. CoT fine-tuning and RLVR modify higher-layer behavior. They cannot improve the lower-layer knowledge encoding that knowledge-intensive tasks depend on. Adding reasoning training to a model that lacks medical knowledge won't close the knowledge gap — it modifies a layer that isn't the bottleneck.

Architectural evidence for layer redundancy: The "Unreasonable Ineffectiveness of the Deeper Layers" (2403.17887) provides striking corroboration. Up to half of LLM layers can be pruned with minimal degradation on question-answering benchmarks, using a simple strategy: identify optimal block of layers to prune by cross-layer similarity, then heal with QLoRA finetuning on a single A100 GPU. This implies either that current pretraining methods are not properly leveraging the parameters in deeper layers, or that shallow layers play a disproportionately critical role in storing knowledge. Both interpretations reinforce the functional separation: if knowledge resides in lower layers, the deeper layers' contribution may be primarily redundant refinement rather than essential computation.

Retrieval heads as mechanistic evidence: The "Retrieval Head" paper provides direct causal evidence for layer specialization. A sparse set of attention heads (<5%) are responsible for retrieving relevant information from long context. These retrieval heads are: (1) universal across model families, (2) intrinsic — they exist in short-context models and persist through context-length extension, (3) dynamically activated — some always attend to required information while others activate contextually, and (4) causal — pruning them causes hallucination while pruning non-retrieval heads has no effect. Retrieval heads strongly influence CoT reasoning (which requires referring back to prior context) but minimally affect tasks where the model generates from intrinsic knowledge. This is a specific mechanistic instantiation of the lower-layer knowledge retrieval function described above. See What mechanism enables models to retrieve from long context?.

Latent concept hierarchy: The "Discovering Latent Concepts Learned in BERT" (2205.07237) confirms the layer hierarchy from a representation perspective. Lower layers dominate in learning shallow lexical concepts, while higher layers learn semantic relations. Critically, BERT learns novel concepts (e.g., animal categories, demographic groups) that do not adhere to predefined categorizations — the model discovers its own organizational structure. Several latent concepts are based on multiple properties spanning semantics, syntax, and morphology simultaneously, suggesting the layer separation is not clean but follows a general gradient.

The "Procedural Knowledge in Pretraining Drives Reasoning" paper provides the data-level explanation that complements this architectural finding. By ranking 5 million pretraining documents by their influence on model completions, they show that reasoning draws on a diffuse set of documents containing procedural knowledge (descriptions of how to solve), while factual recall draws on narrow document sets containing the target fact. This maps directly onto the layer separation: lower layers store memorized facts (requiring document-specific exposure), while higher layers encode procedural strategies (learnable from general demonstrations of method). See Does procedural knowledge drive reasoning more than factual retrieval?.

Inquiring lines that use this note as a source 57

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

23 direct connections · 201 in 2-hop network ·medium cluster Open in graph ↗

Why does reasoning training help math but hurt m… Does medical AI need knowledge or reasoning more? Why doesn't mathematical reasoning transfer to med… Do language models actually use their encoded know… Can text-trained models compress images better tha…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does medical AI need knowledge or reasoning more? Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?
layer localization is the mechanistic explanation for the behavioral pattern this note documents
Why doesn't mathematical reasoning transfer to medicine? Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.
transfer fails because SFT modifies higher-layer reasoning while the bottleneck is lower-layer knowledge; this paper makes that precise
Do language models actually use their encoded knowledge? Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
layer localization explains the encoding-generation gap: knowledge in lower layers may be overridden by higher-layer reasoning adjustments that introduce error, producing the failure mode where the model "knows" the answer but generates an incorrect one
Can text-trained models compress images better than specialized tools? Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.
the compression framing maps onto the layer separation: lower layers compress facts (document-specific memorization), higher layers compress procedures (generalizable reasoning); the scaling caveat on adjusted compression may reflect redundancy in deeper layers

Why does reasoning training help math but hurt medical tasks?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4