Why does reasoning fine-tuning suppress the confidence signals that adaptive retrieval needs?
This explores a hidden conflict between two training goals: fine-tuning a model to reason or answer well tends to wreck the calibrated uncertainty signal that 'retrieve only when unsure' systems depend on to decide when to look things up.
This question sits at the collision point of two ideas the corpus treats separately. On one side, the most efficient adaptive-retrieval systems don't use elaborate heuristics at all — they just read the model's own calibrated token-probability uncertainty and retrieve when it dips, which beats multi-call methods at a fraction of the cost Can simple uncertainty estimates beat complex adaptive retrieval?. The whole approach rests on one assumption: the model's confidence is an honest readout of whether it actually knows. On the other side, fine-tuning quietly violates that assumption.
The damage shows up in what fine-tuning optimizes for. Supervised fine-tuning raises final-answer accuracy on benchmarks while cutting the genuine inferential content of reasoning by nearly 39% — the model learns to produce correct answers through post-hoc rationalization rather than working them out Does supervised fine-tuning improve reasoning or just answers?. Faithfulness tests sharpen the point: after fine-tuning, reasoning chains less reliably cause the final answer at all, becoming performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. If the reasoning no longer drives the answer, then the confidence attached to that answer is no longer tracking a real inferential process — it's tracking how well the model has learned to look right. That is precisely the signal adaptive retrieval is trying to read, now corrupted.
Reward-based training names the mechanism more directly. RLHF is shown to actively degrade calibration — the model's stated confidence drifts away from its actual accuracy Can model confidence work as a reward signal for reasoning?. When you train a model to maximize a correctness or preference signal, you push its probability mass toward confident-looking outputs regardless of whether it should be uncertain. The retrieval gate that depended on seeing a low-confidence dip stops firing, because the model has been trained out of expressing doubt.
The corpus also points at the fix, which doubles as confirmation of the cause. RLSF reverses the calibration damage precisely by making confidence itself the training target — using answer-span confidence to rank reasoning traces — restoring calibration while still improving reasoning Can model confidence work as a reward signal for reasoning?. The fact that you can repair confidence by optimizing for it tells you the standard objectives were silently optimizing against it. There's a deeper reason confidence is worth protecting: it isn't noise. Model confidence directly predicts robustness — highly confident models resist prompt rephrasing while low-confidence ones swing wildly Does model confidence predict robustness to prompt changes?. Confidence is a genuine internal signal of stability, which is exactly why flattening it through fine-tuning is so costly.
The unexpected turn for a curious reader: the problem isn't that fine-tuned models know less. It's that they stop being able to tell you when they don't know. Related work suggests the better lever is training that rewards reasoning quality rather than token-level correctness — RL that internalizes coherent knowledge structures outperforms SFT precisely because it doesn't reduce everything to final-answer matching Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. Preserve the honesty of the uncertainty signal, and adaptive retrieval keeps working; optimize it away in pursuit of benchmark accuracy, and you blind the system to its own ignorance.
Sources 6 notes
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.