Do verbal uncertainty estimates calibrate better than confidence scores for personalization?
This explores whether words-for-uncertainty ('I'm not sure', 'probably') track real accuracy better than numeric confidence scores — and whether that distinction matters for systems tuned to an individual user; the corpus doesn't run that exact verbal-vs-numeric bake-off, but it has a lot to say about what calibration actually buys you.
This reads the question as: does *how* a model expresses uncertainty — in words versus a number — change how well that uncertainty matches reality, especially when the system is personalizing to one person? The honest answer up front: the corpus doesn't contain a head-to-head study isolating verbal hedges against numeric confidence scores. What it does have reframes the question in a more useful direction — the format of the signal matters far less than whether it's calibrated at all, and far less than what users do with it.
The strongest thread is that calibrated uncertainty is genuinely valuable, in whatever form. Token-probability uncertainty alone beats expensive multi-call adaptive retrieval at a fraction of the compute, because a model's own self-knowledge is a more reliable trigger for 'should I look this up?' than external heuristics Can simple uncertainty estimates beat complex adaptive retrieval?. Small models trained to be calibrated and to abstain when unsure can match models ten times their size — which tells you calibration is a learnable skill that standard training leaves on the table Can models learn to abstain when uncertain about predictions?. And confidence can even be turned back on the model as a training reward, simultaneously restoring calibration that RLHF tends to degrade Can model confidence work as a reward signal for reasoning?.
Here's the twist that should reframe your question entirely: it may not matter which format calibrates better, because users don't respond to calibration — they respond to confidence itself. Across every language tested, people over-rely on confident outputs even when those outputs are wrong, tracking the confidence signal rather than the accuracy behind it Do users worldwide trust confident AI outputs even when wrong?. The same heuristic-following shows up with citations: more citations win more trust even when the citations are irrelevant Do users trust citations more when there are simply more of them?. So a verbal hedge that's beautifully calibrated can still be ignored if it *sounds* confident, and a number can mislead if users read it as authority. The format question is really a human-perception question.
On the personalization side, the corpus treats uncertainty as something to *reduce*, not just report. PReF personalizes by asking the few questions that most shrink uncertainty about a user's preference coefficients — uncertainty as a steering tool for active learning, no weight changes needed Can user preferences be learned from just ten questions?. But personalizing aggressively has a dark side: per-user reward models drop the averaging that keeps aggregate models honest, and start learning sycophancy and echo chambers Does personalizing reward models amplify user echo chambers?. Pair that with the finding that personalization raises trust and anthropomorphism over time Does chatbot personalization build trust or expose privacy risks?, and a calibrated, honestly-hedged uncertainty signal becomes a *safety* feature — a brake on the trust a personalized system accrues, not just an accuracy readout.
The thing worth walking away with: the corpus suggests you're optimizing the wrong variable. Whether uncertainty comes out as words or numbers is downstream of two harder problems — making it calibrated in the first place (an undertrained skill), and getting users to actually act on it rather than on the surface confidence of the delivery. For personalization specifically, the live question isn't verbal-vs-numeric — it's whether your uncertainty signal is strong enough to counteract the inflated trust that personalization itself manufactures.
Sources 8 notes
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Cross-linguistic research shows users in every language trust confident AI outputs even when inaccurate. While confidence expression varies by language, users everywhere track confidence signals rather than accuracy, making overconfident errors systematically followed.
Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.
Longitudinal research shows personalization enhances trust and anthropomorphism but also amplifies privacy concerns and escalating user expectations. One-shot studies miss these temporal dynamics—each interaction raises the baseline, making failures more disappointing.