Why does reasoning fine-tuning reduce model abstention capacity by 24 percent?
This explores why training a model to reason harder makes it worse at saying 'I don't know' — and what that 24% drop reveals about a deeper trade-off baked into reasoning fine-tuning.
This explores why training a model to reason harder makes it worse at saying 'I don't know.' The headline result is direct: models optimized for reasoning performance answer more often, express unwarranted confidence, and fail to abstain when they should — roughly a 24% drop in abstention capacity. The reason is in the training signal, not some exotic failure. Reasoning fine-tuning rewards producing a complete, confident answer and systematically punishes 'I don't know' responses, so the model learns that abstaining is the losing move Does reasoning fine-tuning make models worse at declining to answer?. The 24% isn't a bug in the optimizer; it's the optimizer working exactly as told.
What makes this interesting is that the corpus suggests it's one symptom of a broader pattern: reasoning fine-tuning tends to optimize the *appearance* of good reasoning while quietly hollowing out its substance. One study found supervised fine-tuning raises benchmark accuracy while cutting 'Information Gain' by 38.9% — models reach correct answers through post-hoc rationalization rather than genuine inferential steps, and standard metrics miss it because they only check the final answer Does supervised fine-tuning improve reasoning or just answers?. A parallel line shows fine-tuning weakens the causal link between the reasoning chain and the answer entirely: you can truncate, paraphrase, or stuff filler into the reasoning and the answer often doesn't change, meaning the reasoning has become performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. Abstention collapse fits this family — the model is rewarded for looking decisive, and confident-looking decisiveness is the opposite of calibrated abstention.
The deeper culprit appears to be calibration damage from preference-style training. RLHF-style optimization is known to degrade a model's calibration — its sense of when it's actually likely to be right — and once calibration is broken, the model can't tell which questions deserve an 'I don't know.' Notably, one approach reverses this: using the model's own answer-span confidence as the reward signal restores calibration *while* improving reasoning, suggesting the abstention problem isn't intrinsic to reasoning training but to *what you reward* Can model confidence work as a reward signal for reasoning?. Reward completeness and confidence, you lose abstention; reward calibrated confidence, you keep both. The 24% is a choice of training target, not a law of nature.
There's a useful cross-current here too. A recurring finding in the collection is that post-training doesn't create reasoning capability — it selects and deploys latent capability already present in the base model, teaching the model *when* to reason rather than *how* Do base models already contain hidden reasoning ability? Does RL post-training create reasoning or just deploy it?. Read against the abstention result, this reframes the whole problem: if fine-tuning mostly shapes deployment behavior, then degraded abstention is a deployment-policy distortion — the model has been taught a policy of 'always answer' — not evidence that its underlying knowledge of its own ignorance was erased. That's hopeful, because deployment policies are far easier to re-tune than lost capabilities.
If you want to go deeper, the SFT accuracy trap and the faithfulness-degradation work are the two doorways that best explain *why* the answer-completeness reward backfires, and the confidence-as-reward paper is the one that shows the most concrete path out — restoring calibration without giving up reasoning gains.
Sources 6 notes
Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.