INQUIRING LINE

Why do improvements in accuracy come at the cost of calibration?

This explores why training that pushes a model to get more answers right often makes its confidence less trustworthy — the accuracy goes up while the model's sense of when it's actually right gets worse.


This explores why training that pushes a model to get more answers right often makes its confidence less trustworthy — and the corpus suggests the trade-off isn't accidental, it's baked into how we reward models. The clearest mechanism is the reward signal itself: when training only scores whether the final answer is correct, it never penalizes a confident wrong answer, so the model learns that high-confidence guessing is the optimal policy Does binary reward training hurt model calibration?. Accuracy and calibration come apart because the objective only ever measured one of them. Strikingly, the same note shows this is fixable — adding a proper scoring rule (the Brier score) as a second reward term makes the model optimize both at once, which means the trade-off is an artifact of an incomplete objective rather than a law of nature.

The deeper pattern is that "accuracy" can rise even as the underlying reasoning rots. Supervised fine-tuning lifts final-answer accuracy while cutting reasoning informativeness by nearly 39% — the model reaches right answers through pattern-matching shortcuts rather than genuine inference, so it becomes more correct and less auditable at the same time Does supervised fine-tuning actually improve reasoning quality?. A model that's right for shallow reasons has no good internal basis for knowing when it's wrong, which is exactly what miscalibration looks like.

What makes this dangerous is that aggregate accuracy actively hides the cost. In medical triage, legal interpretation, and financial planning, fluent confident errors concentrate in the rare, high-harm cases — and overall accuracy looks great precisely because those failures are statistically swamped by easy correct cases Why do confident wrong answers hide in standard accuracy metrics?. So optimizing for the headline number can quietly worsen the thing you'd most want calibrated: the model's hesitation on the cases where it shouldn't be sure.

The corpus also reframes calibration as a *directional* failure, not a single dial. Reasoning-trained models under-abstain and over-answer because abstention earns no reward, while safety-trained models over-abstain and refuse benign questions — same broken calibration, opposite tilt, each inherited from whichever objective dominated training Does training objective determine which direction models fail at abstention?. This is the lateral key to the whole question: calibration is a fingerprint of what you rewarded, so any accuracy-maximizing objective that ignores confidence will leave its own signature distortion.

There's a quieter cousin worth knowing about. Asymmetric, utility-weighted losses correctly sharpen *decisions* but weaken representation learning, so training to act well can degrade what the model actually learns to represent — and the fix is to learn with a symmetric loss, then adjust predictions afterward Can utility-weighted training loss actually harm model performance?. And once you're suspicious of confidence at all, note that even a model's apparent certainty is slippery: deterministic settings make outputs *consistent* without making them *reliable* — repeating the same answer 100 times doesn't mean it's well-calibrated Does setting temperature to zero actually make LLM outputs reliable?. Taken together, the corpus's answer is that calibration is collateral: it suffers whenever the training target rewards being right without also rewarding knowing how right you are.


Sources 6 notes

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Why do confident wrong answers hide in standard accuracy metrics?

Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.

Does training objective determine which direction models fail at abstention?

Reasoning-trained models under-abstain and overanswer because abstention is unrewarded, while safety-trained models over-abstain and refuse benign questions. This reveals calibration is not a single fixable axis but a characteristic failure signature that depends on which objective dominated training.

Can utility-weighted training loss actually harm model performance?

Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: Why do improvements in accuracy come at the cost of calibration?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Nov 2025. A curated library identified:

• Binary reward signals (accuracy-only) provably degrade calibration; adding a proper scoring rule (Brier score) as a second reward term recovers both (~2024).
• Supervised fine-tuning raises final-answer accuracy while cutting reasoning informativeness by ~39%; models reach correct answers via pattern-matching, losing internal basis for uncertainty (~2024).
• Fluent confident errors concentrate in rare, high-harm cases and are invisible to aggregate accuracy metrics in safety-critical domains (~2024).
• Abstention direction (under- vs. over-) is a fingerprint of training objective; reasoning-trained models under-abstain, safety-trained over-abstain (~2025).
• Deterministic LLM settings create consistency without reliability; repeating an answer 100 times doesn't guarantee calibration (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.14905 (Feb 2024) – MobileLLM
• arXiv:2409.15360 (Sep 2024) – Reward-Robust RLHF
• arXiv:2506.09038 (Jun 2025) – AbstentionBench
• arXiv:2511.07699 (Nov 2025) – Misaligned by Design

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, judge whether newer models (GPT-4o, o1, Claude 4), methods (preference learning, DPO, reward-robust variants), tooling (confidence-scoring APIs, verifier orchestration), or evals (calibration benchmarks, utility-weighted metrics) have since relaxed or overturned it. Separate the durable question (likely: *can* training optimize both accuracy and calibration simultaneously?) from perishable limitations (e.g., "binary rewards force trade-offs" — has reward-robust RLHF solved this?). Cite what resolved each, plainly flag what still holds.
(2) SURFACE THE STRONGEST DISAGREEMENT. The library notes asymmetric loss functions weaken learning; does recent work on preference-based methods or utility-weighted objectives contradict or extend this?
(3) PROPOSE TWO RESEARCH QUESTIONS that assume the regime may have moved: e.g., "If modern verifiers can cheaply score uncertainty, does the accuracy–calibration trade-off disappear in practice?" or "Do emergent reasoning traces in 2025 models (o1-style) recover calibration without explicit reward retuning?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines