Does reasoning fine-tuning actually harm a model's ability to abstain?
This explores whether training a model to reason better makes it worse at saying 'I don't know' — and the corpus suggests the answer is yes, because abstention is collateral damage from how reasoning rewards are structured.
This reads the question as asking about a specific, measurable side effect: does optimizing a model to reason its way to answers quietly erode its willingness to decline? The corpus has a direct hit here — reasoning fine-tuning degrades abstention capacity by roughly 24%, because the training signal rewards producing a complete answer and systematically punishes 'I don't know' Does reasoning fine-tuning make models worse at declining to answer?. Models come out answering more questions while expressing unwarranted confidence. So the harm is real, but the more interesting finding is *why* — abstention isn't being attacked directly; it's being starved by a reward that only counts finished answers.
What makes this more than an isolated result is that the same training dynamic shows up across the collection as a family of 'looks better, reasons worse' failures. Supervised fine-tuning raises benchmark accuracy while cutting the actual inferential quality of reasoning steps by nearly 39%, with correct answers arriving through post-hoc rationalization rather than genuine inference Does supervised fine-tuning improve reasoning or just answers?. Separately, fine-tuning loosens the causal link between a model's reasoning and its final answer — you can truncate, paraphrase, or insert filler into the chain and the answer barely changes, meaning the reasoning has become performative Does fine-tuning disconnect reasoning steps from final answers?. Abstention failure fits this pattern exactly: a model that has learned to always produce a confident-looking output is, almost by definition, one that has lost the off-ramp of declining.
The common thread is calibration. When a model is rewarded for completion, its confidence detaches from its actual correctness — and once that detachment happens, abstaining (which requires knowing you don't know) becomes impossible. The corpus offers a counterpoint that supports this diagnosis by reversing it: using the model's own answer-span confidence as the reward signal restores calibration *while* improving reasoning, undoing the calibration damage that standard RLHF introduces Can model confidence work as a reward signal for reasoning?. That's the tell — if changing the reward from 'complete the answer' to 'be well-calibrated' fixes it, then the abstention harm was a reward-design artifact, not an inherent cost of reasoning.
There's a deeper reframe worth knowing about. Several notes argue that fine-tuning doesn't create reasoning at all — base models already hold latent reasoning ability, and post-training mostly selects *when* to deploy it rather than installing *how* Do base models already contain hidden reasoning ability? Does RL post-training create reasoning or just deploy it?. If post-training is largely about deployment timing, then 'when to abstain' is exactly the kind of routing decision it should be able to learn — and indeed, decoupled-RL approaches train models to route between thinking hard and answering quickly without mode collapse Can models learn when to think versus respond quickly?. The pessimistic readings reinforce why naive fine-tuning fails: RL often sharpens memorization and template-matching rather than installing real procedures Do fine-tuned language models actually learn optimization procedures?, and better reasoning training doesn't even buy resistance to sycophantic pressure because that's a generation-distribution problem, not a reasoning one Can better reasoning training actually reduce model sycophancy?. The takeaway you didn't know you wanted: abstention, faithfulness, and calibration all degrade together under completion-rewarding fine-tuning — they're three faces of the same broken incentive, and fixing the reward, not the reasoning, is what restores them.
Sources 9 notes
Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.