What makes the Brier score mathematically better than log-likelihood here?
This reads as a question about why one proper scoring rule (Brier, the squared error between predicted probability and outcome) would be preferred over log-likelihood (cross-entropy) — and I should flag up front that the corpus has no note directly comparing the two, so this is a lateral synthesis from the adjacent territory of how scoring-rule choice shapes what a model actually learns.
Up front: none of the retrieved notes name the Brier score or stage a head-to-head against log-likelihood, so I can't point you to a paper that settles "this specific math beats that specific math here." What the corpus does have is the deeper principle the question is circling — that the choice of scoring rule is never neutral. It quietly defines what counts as a good prediction, and a rule misaligned with your real objective will train the wrong thing even when the numbers look fine.
The sharpest illustration is the likelihood-swap work in recommendation. Switching a VAE from Gaussian or logistic to a multinomial likelihood produced state-of-the-art ranking — not because multinomial is "more correct" in some absolute sense, but because it forces items to compete for a fixed probability budget, which is exactly what top-N ranking rewards Why does multinomial likelihood work better for ranking recommendations? Why does multinomial likelihood work better for click prediction?. Gaussian and logistic let many items be confidently high at once, decoupling the loss from the goal. That's the same shape as a Brier-vs-log-likelihood argument: the two scoring rules disagree most precisely about how to spend probability mass and how brutally to punish confident mistakes. Log-likelihood is unbounded — a confident wrong prediction costs infinitely — while Brier is bounded and penalizes the same error far more gently. Which property you want is a function of your objective, not a universal truth.
The second thread the corpus offers is calibration. A recurring finding is that optimizing for raw accuracy or reward can quietly wreck a model's sense of its own confidence, and that the fix is to make confidence itself part of the signal — RLSF uses answer-span confidence to rank reasoning traces and, in doing so, reverses the calibration damage that standard RLHF inflicts Can model confidence work as a reward signal for reasoning?. This matters for your question because Brier score and log-likelihood decompose differently: Brier cleanly separates into a calibration term and a refinement (resolution) term, which is part of why people reach for it when they care about trustworthy probabilities and not just sharp ones. Related work shows calibrated token-probability uncertainty can outperform far more expensive machinery for deciding when a model should act on its own confidence Can simple uncertainty estimates beat complex adaptive retrieval?.
There's also a caution worth importing. A low loss under any scoring rule can be an artifact rather than a signal — deterministic decoding makes outputs look stable while they remain a single draw from a distribution Does setting temperature to zero actually make LLM outputs reliable?, and impressive benchmark numbers can be memorization rather than the capability the metric claims to measure Does RLVR success on math benchmarks reflect genuine reasoning improvement?. The lesson that generalizes to Brier vs log-likelihood: a scoring rule is only "better" relative to what you're trying to surface, and either rule can be gamed if you stop asking whether the metric still tracks the thing you care about.
So the honest answer is that the corpus reframes your question rather than answering it: "mathematically better" almost always resolves to "better aligned with the objective and the calibration behavior you need." If you want the actual proper-scoring-rule decomposition and the bounded-vs-unbounded penalty math, that lives outside this collection — but the collection's repeated verdict is that the rule which competes probability the way your goal does, and keeps confidence honest, is the one that wins.
Sources 6 notes
Liang et al. show that switching VAE likelihoods from Gaussian/logistic to multinomial achieves state-of-the-art results because enforced probability competition between items directly aligns training with top-N ranking objectives. Rebalancing KL regularization further improves performance.
Multinomial likelihood better models click data because it forces items to compete for a fixed probability budget, implicitly optimizing for top-N ranking. Gaussian and logistic likelihoods allow high probability across many items simultaneously, misaligning training with ranking objectives.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.