Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
AbstentionBench measures something most reasoning benchmarks ignore: whether models know when not to answer. The finding is that fine-tuning for reasoning performance — the process that produces o1, R1, and similar models — degrades abstention capacity by approximately 24%.
Models optimized for reasoning say "I don't know" less often. They answer when they should decline. They express confidence when uncertainty is appropriate.
This is not a paradox once you understand the training dynamics. Reasoning fine-tuning rewards chains that produce answers. The reward signal is: generate a complete, confident response to the question. This is the right signal for most reasoning tasks. But it systematically punishes abstention — "I don't know" is the one output that terminates the chain without a scorable answer.
The result: models that are better at reasoning when they know the answer and worse at recognizing when they don't.
This adds a dimension to Does more thinking time actually improve LLM reasoning?. The overthinking finding showed that extended token-level thinking degrades performance above a threshold. AbstentionBench shows a different form of the same problem: fine-tuning-level training shifts the model toward answering regardless of whether answering is appropriate. The trade-off operates at two different timescales — inference-time and training-time — but both involve the same directional failure: more reasoning commitment produces worse calibration.
The deployment implication is severe. Systems that rely on reasoning models to flag uncertainty — medical decision support, legal research, financial analysis — are working with tools that have been specifically optimized to not flag uncertainty. The capability improvement is real; the calibration regression is equally real and mostly invisible in standard benchmarks.
The "Hallucination Tax of Reasoning Fine-Tuning" paper quantifies an even more extreme version: RFT reduces model refusal rates by over 80%, meaning models that previously correctly declined to answer now generate plausible-sounding but fabricated responses. The proposed mitigation — Safety-Utility Mix (SUM) training that blends safety-aligned data into the reasoning fine-tuning process — restores approximately 10% of safety behavior without quality loss. But this partial restoration underscores the severity: reasoning fine-tuning doesn't just reduce abstention by 24% — it can nearly eliminate the refusal mechanism entirely, replacing "I don't know" with confidently wrong answers.
The legal overruling benchmark (Domain Specialization source) adds a counterpoint: base models (Gemini-Flash, GPT-4 class) that are not reasoning-optimized show the opposite failure in complex legal tasks — they abstain too much, setting their uncertainty threshold too high and failing to make correct judgments even when the evidence is available. GPT-5 (a reasoning-optimized model) showed a notably lower abstention rate in the same task — consistent with the under-abstention pattern documented here. Calibration failure appears bidirectional: reasoning-optimized models over-answer; non-reasoning-optimized models over-abstain. See tension: Does training objective determine which direction models fail at abstention?.
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does training for better reasoning reduce an AI system's ability to abstain?
- Why does reasoning fine-tuning reduce model abstention capacity by 24 percent?
- Does fine-tuning models for specific tasks destroy their ability to reason?
- Does reasoning fine-tuning actually reduce a model's ability to abstain?
- What training signals would teach models when not to reason?
- Do models trained for reasoning lose their ability to decline questions?
- What happens when reasoning fine-tuning eliminates model refusal mechanisms entirely?
- Does reasoning fine-tuning actually damage a model's ability to abstain?
- How do reasoning improvements suppress a model's ability to abstain?
- Does reasoning fine-tuning actually harm a model's ability to abstain?
- Why does reasoning fine-tuning reduce a model's ability to abstain?
- Why does reasoning fine-tuning reduce models' ability to abstain?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
inference-time overthinking; this adds the training-time calibration cost
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
token-level degradation; this is training-level degradation
-
Why do language models fail confidently in specialized domains?
LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?
related overconfidence pattern in different domain
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
third SFT cost: reasoning quality (InfoGain), alongside abstention calibration documented here
-
Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
abstention degradation compounds the underspecification problem: reasoning training both suppresses the ability to disengage (this note) and fails to develop the diagnostic ability to identify what information is missing (that note)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
- On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
- Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs
- TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
- Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
Original note title
reasoning fine-tuning degrades abstention capacity by 24 percent revealing a hidden cost of the reasoning-performance trade-off