SYNTHESIS NOTE

Why doesn't mathematical reasoning transfer to medicine?

Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.

Synthesis note · 2026-02-21 · sourced from Domain Specialization

The assumption behind porting reasoning-capable models to specialized domains is that reasoning ability transfers — that a model trained to reason well about mathematics can be steered toward medical reasoning through fine-tuning. The Knowledge or Reasoning paper falsifies this assumption with a specific mechanism.

R1-distilled models — fine-tuned variants of strong base models specifically trained to produce long reasoning chains — do not outperform base models on medical benchmarks when evaluated with domain-specific metrics (KI/InfoGain). The general reasoning capabilities that make R1-distilled models effective on mathematical tasks do not transfer to the medical domain via either SFT or RL. The limiting factor is domain knowledge, not reasoning architecture.

The mechanism is clarified by the KI/InfoGain framework. In medical tasks, knowledge accuracy (KI) correlates more strongly with final accuracy than reasoning step informativeness (InfoGain) across four of five benchmarks. Mathematical reasoning has the inverse pattern: reasoning quality matters more than factual knowledge retrieval. These are different competency regimes. A model optimized for one regime cannot import its advantages to the other.

This is distinct from Can non-reasoning models catch up with more compute?, which is about inference-compute regime differences within the same training framework. That finding says you can't close the gap by adding more inference-time compute. This finding says you can't close the gap by fine-tuning either — the gap is in the underlying domain knowledge, which fine-tuning on the wrong type of reasoning traces cannot supply.

The practical implication for domain AI deployment: a strong general reasoning model is not a substitute for domain-specific training data. In knowledge-intensive domains, the ceiling is what the model knows, not how it reasons. Systems that assume general reasoning strength translates to domain-specific reliability will be overconfident about their actual performance. Does supervised fine-tuning actually improve reasoning quality? adds that even when SFT improves accuracy in domain tasks, the reasoning quality may degrade — compounding the problem.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 178 in 2-hop network ·dense cluster Open in graph ↗

Why doesn't mathematical reasoning transfer to m… Can non-reasoning models catch up with more comput… Does medical AI need knowledge or reasoning more? Does supervised fine-tuning actually improve reaso… Does RL improve domain reasoning by adding knowled… Can models learn reasoning from predicting any tex…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can non-reasoning models catch up with more compute? Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
compute-regime gap; this is knowledge-regime gap — different mechanisms
Does medical AI need knowledge or reasoning more? Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?
why transfer fails: the two domains require different model investments
Does supervised fine-tuning actually improve reasoning quality? While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
SFT improves accuracy but doesn't solve the underlying knowledge problem
Does RL improve domain reasoning by adding knowledge or removing it? When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
RL corrects reasoning paths but can't substitute for missing domain knowledge
Can models learn reasoning from predicting any text? Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.
extends: even Quiet-STaR's token-level general reasoning is bounded by training corpus diversity; this note explains the harder ceiling once deployment hits knowledge-intensive domains

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

general reasoning does not transfer to knowledge-intensive domains via sft due to domain knowledge gaps

Why doesn't mathematical reasoning transfer to medicine?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4