Does supervised fine-tuning improve reasoning or just answers?
Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
Post angle for Medium / LinkedIn
Hook: "Every AI benchmark measures accuracy. What if accuracy is exactly the wrong thing to measure when deploying AI in high-stakes domains?"
The finding: The Knowledge or Reasoning paper introduces two new metrics — Knowledge Index (KI: factual correctness of each reasoning step) and Information Gain (InfoGain: how much each reasoning step reduces uncertainty toward the final answer). When they apply these metrics to SFT-trained models on medical and mathematical tasks, they find that SFT raises final-answer accuracy while cutting InfoGain by 38.9%. Models get more answers right while reasoning toward them less informationally.
The mechanism: SFT rewards answers, not reasoning paths. Training data has question-answer pairs. The loss function anchors on the correct final output. Models learn the most efficient path to the right answer in the training distribution — often domain-specific shortcuts, pattern matches, and frequency-weighted heuristics that produce the correct answer without the inferential chain that would justify it. The reasoning in the output becomes post-hoc rationalization.
Why this matters for deployment: High-stakes domains don't just need correct answers — they need auditable reasoning. Medical decision support must show clinical logic. Legal AI must demonstrate how conclusions follow from statute and precedent. Financial AI must show how recommendations connect to market data and regulatory context. SFT improves the answer, but may make the reasoning path less meaningful — more verbose decoration around the correct output than the pathway that produced it.
The measurement problem: Standard benchmarks measure what's easy to measure: whether the final answer matches the ground truth. InfoGain and KI require decomposing reasoning chains and evaluating each step against external ground truth — expensive and difficult to automate at scale. So the measurement gap persists, and every organization that deploys based on benchmark accuracy is systematically blind to the reasoning quality regression.
The connection: This extends the existing cluster of overthinking findings into the training dimension. Does extended thinking actually improve reasoning or just increase variance? at inference-time. Does reasoning fine-tuning make models worse at declining to answer? at training-time for a different cost (calibration). The SFT accuracy trap is the third entry: training-time cost to reasoning quality.
Platform: Medium (1000–1400 words). Could lead with the FALM / medical AI deployment angle, then introduce the measurement framework.
Inquiring lines that use this note as a source 93
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can debugging skills be validated if AI training degraded them first?
- How does AI substitute polished style for actual expert judgment?
- Why do persuasive AI techniques also reduce factual accuracy?
- Does evaluating AI output require different cognitive skills than solving problems directly?
- How should we redesign benchmarks to catch conservative bias in reasoning tasks?
- What training methods make models more persuasive but less factually accurate?
- How does AI assistance differ from search engines in cognitive impact?
- Could AI assessment quality differ across subjects or question formats?
- Why do single examples trigger large reasoning improvements in models?
- How does optimizing for accuracy during training degrade downstream reasoning quality?
- Does training for better reasoning reduce an AI system's ability to abstain?
- How does critique fine-tuning on one problem unlock broader reasoning?
- Why does reasoning fine-tuning reduce model abstention capacity by 24 percent?
- Can organized response format trick users into overestimating AI reliability?
- Do tokens beyond a critical threshold actually improve reasoning quality?
- Does training on critiques of noisy responses produce deeper understanding than imitating correct ones?
- Can budget-tightening curricula improve reasoning efficiency more than fixed budgets?
- Can verifier-guided search catch factual errors that reasoning training cannot?
- Does domain training degrade reasoning ability even when benchmark scores rise?
- Why does fine-tuning degrade reasoning quality even as accuracy improves?
- Why does fine-tuning sometimes damage chain-of-thought reasoning even when accuracy improves?
- Can AI evaluation tools solve the verification problem they help create?
- Why does domain accuracy improve while reasoning quality degrades after supervised fine-tuning?
- Can fine-tuning ever teach semantic inference instead of amplifying training shortcuts?
- Does supervised fine-tuning improve accuracy while damaging the quality of reasoning?
- Does fine-tuning models for specific tasks destroy their ability to reason?
- How much reasoning catalyst data is actually needed for improvement?
- Does reasoning fine-tuning actually reduce a model's ability to abstain?
- How do contrasting examples improve AI feedback quality over generic suggestions?
- What makes training data quality more important than quantity for reasoning?
- Can preference optimization training make models worse at detecting false presuppositions?
- How does fine-tuning on natural language inference affect fallacy susceptibility?
- Can critic model trios evaluate reasoning quality more reliably than outcome rewards alone?
- Can attribute-specific preference optimization improve question quality in information-seeking?
- Do models trained for reasoning lose their ability to decline questions?
- Can question quality be trained separately from the decision to ask?
- What happens when reasoning fine-tuning eliminates model refusal mechanisms entirely?
- Does formal reasoning training actively degrade social reasoning ability?
- Can training improve reasoning coherence without improving actual correctness?
- Why does supervised fine-tuning degrade reasoning quality despite raising accuracy?
- Why does SFT reduce reasoning quality even when improving domain accuracy?
- Can a single correct example seed exponential improvement in mathematical reasoning?
- Does reasoning fine-tuning actually damage a model's ability to abstain?
- Can AI evaluation match human judgment quality in structured domain tasks?
- Does SFT degrade reasoning quality while improving domain accuracy?
- Does reinforcement learning preserve reasoning quality better than supervised fine-tuning?
- What training cost tradeoffs exist between fine-tuning and other knowledge injection methods?
- Why does increasing reasoning not improve AI social reasoning performance?
- What training approach enables models to proactively request clarification?
- How does data quality mismatch create reasoning degradation in supervised fine-tuning?
- Does reasoning fine-tuning actually harm a model's ability to abstain?
- Can reasoning catalyst data serve as a stable foundation for test-time training?
- What alternatives to RLHF better preserve truth-seeking in AI outputs?
- What makes a first answer so often the best answer a model produces?
- What role could knowledge custodians play in validating AI output?
- Why do benchmark scores rise while reasoning quality declines?
- Can reasoning fine-tuning improve both capability and instruction compliance together?
- Why does reasoning fine-tuning reduce a model's ability to abstain?
- Why does automated evaluation consistently overestimate research quality?
- Does unrestricted reasoning per search step degrade iterative quality over time?
- How can one training example improve reasoning across thousands of unseen problems?
- Why do AI benchmarks measure accuracy instead of reasoning quality?
- Can models be trained to explain instead of imitate answers?
- What makes reasoning auditable in medical AI decision support?
- Can we improve reasoning by amplifying information at mutual information peaks?
- Can attribute decomposition improve other interactive reasoning tasks beyond clinical questioning?
- Why does training data format shape reasoning strategy more than content?
- Can adversarial critics force genuine reasoning the same way critique fine-tuning does?
- Why does reasoning training improve math but hurt knowledge tasks?
- Does supervised fine-tuning improve reasoning or just response formatting?
- Can reinforcement learning fix the reasoning gaps that supervised fine-tuning misses?
- Can evaluation trajectories and interaction histories replace single-answer scoring?
- Why does eliminating proxy-model filtering improve reasoning emergence in pretraining?
- Why do explicit quality criteria outperform learning quality from examples alone?
- Does fine-tuning push models toward reasoning shortcuts that bypass the chain entirely?
- Can models maintain reasoning-output coupling while improving domain accuracy?
- Can personalized AI learning systems actually widen rather than narrow educational gaps?
- Can thought quality alone be trusted to guide model training?
- Why does reasoning fine-tuning suppress the confidence signals that adaptive retrieval needs?
- Can smaller amounts of diverse reasoning demonstrations replace exhaustive factual training data?
- What quality filters distinguish useful reasoning enrichment from shallow repetition?
- Does decoupling reasoning from tool use actually improve accuracy?
- Can structured questioning prompts improve reasoning beyond standard conversational training?
- Can single-problem fine-tuning match full RL pipeline reasoning gains?
- Can approximate or noisy reference answers work for RL-based reasoning training?
- How does supervised fine-tuning degrade chain-of-thought faithfulness over time?
- Why does reasoning fine-tuning reduce models' ability to abstain?
- What makes some training data teach brittle answers versus robust reasoning?
- Why does document perplexity stay low while question-answering accuracy drops?
- How does preference learning differ from supervised finetuning for reasoning?
- What other agent behaviors besides citations reveal reasoning quality?
- Can small demonstration sets unlock general reasoning without large question data?
- How does question difficulty and breadth affect what models learn to reason?
Related concepts in this collection 10
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
the underlying insight this post dramatizes
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
parallel SFT cost: calibration vs. reasoning quality
-
Does extended thinking actually improve reasoning or just increase variance?
When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
inference-time version of the same accuracy vs. quality trade-off
-
Does critiquing errors teach deeper understanding than imitating correct answers?
Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
counter-strategy: CFT addresses the SFT accuracy trap by replacing correct-answer imitation with structured failure analysis as the training objective
-
Why do better reasoning models ignore instructions?
As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?
a third dimension of SFT/RL training cost: SFT degrades reasoning quality (this note), reasoning training degrades instruction adherence (instruction-following deficit), and both reflect the same pattern — optimizing one capability structurally degrades another
-
Can language models solve ToM benchmarks without real reasoning?
Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.
ToM benchmarks are a concrete case of the SFT accuracy trap: SFT achieves competitive ToM scores without reasoning training, suggesting benchmarks reward structural pattern exploitation rather than genuine mental state reasoning
-
Why does SFT-then-RL training follow a predictable three-phase pattern?
When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.
temporal dynamics: CHORD reveals the SFT accuracy trap as the first phase of a three-phase progression; RL can recover from SFT's reasoning degradation but only if SFT and RL are integrated as a continuous spectrum rather than hard-sequenced stages
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
parallel training-induced degradation: SFT degrades reasoning quality (this note) while RLHF degrades conversational grounding; both demonstrate that optimizing for what benchmarks and raters measure structurally erodes capabilities that require different evaluation frameworks
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER provides the representational diagnosis: SFT may produce fractured internal representations that yield correct answers through pattern-matching shortcuts while the underlying structure is broken in ways standard benchmarks cannot detect
-
Does fine-tuning disconnect reasoning steps from final answers?
When models are fine-tuned on specific domains, do their chain-of-thought steps become less causally connected to their outputs? Three experiments test whether reasoning chains remain functionally faithful after training.
a second dimension of SFT damage beyond InfoGain: fine-tuning reduces how much reasoning steps causally influence the final answer, making the chain performative rather than functional; together with InfoGain degradation, SFT damages both reasoning quality and reasoning faithfulness
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
- Learning to Reason for Factuality
- Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Eliciting Reasoning in Language Models with Cognitive Tools
- Post-Completion Learning for Language Models
- OpenThoughts: Data Recipes for Reasoning Models
Original note title
the sft accuracy trap — training raises benchmark scores while degrading reasoning quality