Does optimizing for model confidence actually improve both performance and calibration simultaneously?
This explores whether using a model's own confidence as the training target genuinely lifts both accuracy and calibration at once — or whether that's a trade-off in disguise.
This question is really asking whether "performance" and "calibration" (does the model's stated confidence actually match how often it's right?) can be optimized together, or whether pushing one quietly breaks the other. The corpus says the answer flips depending on *what* you reward. The most striking result is the counter-case: plain binary correctness rewards — right gets a point, wrong gets zero — actively wreck calibration, because nothing punishes a confidently wrong answer, so the model learns to bluff Does binary reward training hurt model calibration?. So "optimize for the answer" and "optimize for honest confidence" are not automatically the same goal.
The encouraging news is that the trade-off isn't fundamental. Adding a proper scoring rule (the Brier score) as a second reward term mathematically guarantees you can raise accuracy *and* calibration with no tension between them Does binary reward training hurt model calibration?. And confidence itself can be the engine rather than the casualty: RLSF ranks reasoning traces by the model's own answer-span confidence, which strengthens step-by-step reasoning while *reversing* the calibration damage that standard RLHF causes — no human labels or external graders needed Can model confidence work as a reward signal for reasoning?. Related work pushes confidence further as a stand-in for an external verifier entirely, using the model's intrinsic probability of a correct answer to drive reinforcement learning into domains where you have no answer key Can model confidence alone replace external answer verification?.
But the corpus also warns that confidence is a noisy instrument, and *how* you read it matters. Averaging confidence across a whole reasoning trace hides local breakdowns — step-level confidence catches the moment the reasoning derails and even lets you stop early, beating global averaging Does step-level confidence outperform global averaging for trace filtering?. So "optimize for confidence" isn't one knob; coarse confidence and fine-grained confidence behave differently.
The deeper caution is that calibration may not be a single axis you can simply tune up. One paper shows a model's failure direction is baked in by its training objective: reasoning-trained models *under*-abstain and over-answer because abstaining earns no reward, while safety-trained models do the opposite and refuse harmless questions Does training objective determine which direction models fail at abstention?. Calibration, in this view, is a characteristic signature of what you rewarded, not a free-floating dial. And confidence can be high *and wrong* in ways no amount of confidence-optimization will fix — when a model confidently hallucinates an unseen entity combination, pretraining-data statistics flag the risk better than the model's own confidence ever could, because they catch the cause rather than the symptom Can pretraining data statistics detect hallucinations better than model confidence?.
The thing you might not have expected to learn: yes, confidence can improve performance and calibration together — but only when the reward is *shaped* so that confident-and-wrong is penalized (a proper scoring rule), and when confidence is read at the right granularity. Reward raw correctness and you teach bluffing; reward confidence naively and you inherit whatever blind spots the model already had. The simultaneous win is real, but it's an engineering property of the objective, not a free lunch.
Sources 6 notes
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Reasoning-trained models under-abstain and overanswer because abstention is unrewarded, while safety-trained models over-abstain and refuse benign questions. This reveals calibration is not a single fixable axis but a characteristic failure signature that depends on which objective dominated training.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).