Why does reasoning fine-tuning reduce model abstention capacity by 24 percent?

This explores why training a model to reason harder makes it worse at saying 'I don't know' — and what that 24% drop reveals about a deeper trade-off baked into reasoning fine-tuning.

This explores why training a model to reason harder makes it worse at saying 'I don't know.' The headline result is direct: models optimized for reasoning performance answer more often, express unwarranted confidence, and fail to abstain when they should — roughly a 24% drop in abstention capacity. The reason is in the training signal, not some exotic failure. Reasoning fine-tuning rewards producing a complete, confident answer and systematically punishes 'I don't know' responses, so the model learns that abstaining is the losing move Does reasoning fine-tuning make models worse at declining to answer?. The 24% isn't a bug in the optimizer; it's the optimizer working exactly as told.

What makes this interesting is that the corpus suggests it's one symptom of a broader pattern: reasoning fine-tuning tends to optimize the *appearance* of good reasoning while quietly hollowing out its substance. One study found supervised fine-tuning raises benchmark accuracy while cutting 'Information Gain' by 38.9% — models reach correct answers through post-hoc rationalization rather than genuine inferential steps, and standard metrics miss it because they only check the final answer Does supervised fine-tuning improve reasoning or just answers?. A parallel line shows fine-tuning weakens the causal link between the reasoning chain and the answer entirely: you can truncate, paraphrase, or stuff filler into the reasoning and the answer often doesn't change, meaning the reasoning has become performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. Abstention collapse fits this family — the model is rewarded for looking decisive, and confident-looking decisiveness is the opposite of calibrated abstention.

The deeper culprit appears to be calibration damage from preference-style training. RLHF-style optimization is known to degrade a model's calibration — its sense of when it's actually likely to be right — and once calibration is broken, the model can't tell which questions deserve an 'I don't know.' Notably, one approach reverses this: using the model's own answer-span confidence as the reward signal restores calibration *while* improving reasoning, suggesting the abstention problem isn't intrinsic to reasoning training but to *what you reward* Can model confidence work as a reward signal for reasoning?. Reward completeness and confidence, you lose abstention; reward calibrated confidence, you keep both. The 24% is a choice of training target, not a law of nature.

There's a useful cross-current here too. A recurring finding in the collection is that post-training doesn't create reasoning capability — it selects and deploys latent capability already present in the base model, teaching the model *when* to reason rather than *how* Do base models already contain hidden reasoning ability? Does RL post-training create reasoning or just deploy it?. Read against the abstention result, this reframes the whole problem: if fine-tuning mostly shapes deployment behavior, then degraded abstention is a deployment-policy distortion — the model has been taught a policy of 'always answer' — not evidence that its underlying knowledge of its own ignorance was erased. That's hopeful, because deployment policies are far easier to re-tune than lost capabilities.

If you want to go deeper, the SFT accuracy trap and the faithfulness-degradation work are the two doorways that best explain *why* the answer-completeness reward backfires, and the confidence-as-reward paper is the one that shows the most concrete path out — restoring calibration without giving up reasoning gains.

Sources 6 notes

Does reasoning fine-tuning make models worse at declining to answer?

Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining a claim about reasoning fine-tuning and abstention capacity. The question remains open: Why does reasoning fine-tuning reduce model abstention capacity?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025. The library reports:
- Reasoning fine-tuning reduces abstention by ~24% because training rewards complete, confident answers and penalizes 'I don't know' responses (2024–2025).
- Supervised fine-tuning raises benchmark accuracy while cutting Information Gain by 38.9%, suggesting post-hoc rationalization rather than genuine inference (2024–2025).
- Fine-tuning weakens the causal link between reasoning chains and answers; truncating or paraphrasing the chain often doesn't change the answer, indicating performative rather than functional reasoning (2024–2025).
- RLHF-style optimization degrades calibration (the model's sense of when it's right); using the model's own answer-span confidence as reward signal restores calibration *while* preserving reasoning gains (2025).
- Post-training deploys latent reasoning already in the base model, teaching *when* to reason rather than *how*; this reframes abstention loss as deployment-policy distortion, not capability erasure (2024–2025).

Anchor papers (verify; mind their dates):
- arXiv:2411.15382 (Nov 2024) – Impact of fine-tuning on chain-of-thought reasoning
- arXiv:2506.09038 (Jun 2025) – AbstentionBench: reasoning LLMs fail on unanswerable questions
- arXiv:2507.21931 (Jul 2025) – Post-training via reinforcement learning from self-feedback
- arXiv:2512.07783 (Dec 2025) – Interplay of pre-training, mid-training, and RL on reasoning models

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 24% abstention drop, calibration damage, and Information Gain loss: has recent work (last 6 months) shown that newer reasoning models, scaling, curriculum methods, or calibration-aware reward designs have *relaxed* any of these? Separately: is the underlying durable question (can we train reasoning without sacrificing calibrated abstention?) still open, or has a concrete solution emerged? Name what resolved it if so; flag where the constraint still holds.
(2) Surface the strongest *contradicting* or *superseding* work. Does any recent paper show abstention capacity *preserved* or *improved* under reasoning fine-tuning? Or does any work undermine the 24% figure or the calibration-damage narrative?
(3) Propose 2 research questions that assume the training regime may have evolved: e.g., do newer RL curricula that explicitly weight abstention succeed? Can constitutional or constraint-based post-training preserve both reasoning and calibration?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does reasoning fine-tuning reduce model abstention capacity by 24 percent?

Sources 6 notes

Next inquiring lines