SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation

Why doesn't mathematical reasoning transfer to medicine?

Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.

Synthesis note · 2026-02-21 · sourced from Domain Specialization
How do you build domain expertise into general AI models? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The assumption behind porting reasoning-capable models to specialized domains is that reasoning ability transfers — that a model trained to reason well about mathematics can be steered toward medical reasoning through fine-tuning. The Knowledge or Reasoning paper falsifies this assumption with a specific mechanism.

R1-distilled models — fine-tuned variants of strong base models specifically trained to produce long reasoning chains — do not outperform base models on medical benchmarks when evaluated with domain-specific metrics (KI/InfoGain). The general reasoning capabilities that make R1-distilled models effective on mathematical tasks do not transfer to the medical domain via either SFT or RL. The limiting factor is domain knowledge, not reasoning architecture.

The mechanism is clarified by the KI/InfoGain framework. In medical tasks, knowledge accuracy (KI) correlates more strongly with final accuracy than reasoning step informativeness (InfoGain) across four of five benchmarks. Mathematical reasoning has the inverse pattern: reasoning quality matters more than factual knowledge retrieval. These are different competency regimes. A model optimized for one regime cannot import its advantages to the other.

This is distinct from Can non-reasoning models catch up with more compute?, which is about inference-compute regime differences within the same training framework. That finding says you can't close the gap by adding more inference-time compute. This finding says you can't close the gap by fine-tuning either — the gap is in the underlying domain knowledge, which fine-tuning on the wrong type of reasoning traces cannot supply.

The practical implication for domain AI deployment: a strong general reasoning model is not a substitute for domain-specific training data. In knowledge-intensive domains, the ceiling is what the model knows, not how it reasons. Systems that assume general reasoning strength translates to domain-specific reliability will be overconfident about their actual performance. Does supervised fine-tuning actually improve reasoning quality? adds that even when SFT improves accuracy in domain tasks, the reasoning quality may degrade — compounding the problem.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 178 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

general reasoning does not transfer to knowledge-intensive domains via sft due to domain knowledge gaps