INQUIRING LINE

How does expressing uncertainty help models avoid the answer-or-abstain dilemma?

This explores why the usual choice—either commit to an answer or refuse entirely—is a false binary, and how a model that can voice graded doubt sidesteps it.


This explores why the usual choice—either commit to an answer or refuse entirely—is a false binary, and how a model that can voice graded doubt sidesteps it. The starting insight is that models hallucinate not because they lack knowledge but because they lack awareness of where their knowledge runs out Can models express uncertainty instead of just answering?. Answer-or-abstain treats every question as a hard yes/no on whether to speak. But a model that can say "probably X, though I'm unsure about the date" gives the user something useful without either bluffing or going silent. The trick is that the expressed doubt has to be *faithful*—calibrated to the model's actual internal uncertainty—rather than a reflexive hedge.

That raises the obvious question: do models even have a reliable internal sense of their own confidence? Evidence says yes, but it's undertrained. Small models given uncertainty-aware objectives and an abstention option can match models ten times their size on forecasting tasks, which means the calibration signal is there and just isn't being cultivated by standard training Can models learn to abstain when uncertain about predictions?. Confidence also turns out to be a real, measurable property: when a model is genuinely confident, it resists having its answers flipped by reworded prompts, while low confidence shows up as wild output swings Does model confidence predict robustness to prompt changes?. So uncertainty isn't a vague mood—it's a quantity you can read off the model and align expression to.

The reason plain abstention falls short is that binary training makes it hard to learn at all. If "refuse" is rewarded the same whether the question was genuinely unanswerable or merely hard, the model can't tell calibrated caution from cowardice. TruthRL fixes this with a three-way reward—correct, hallucinated, abstained—where abstention earns an intermediate score, making the doubt itself learnable and cutting hallucinations by nearly 30% Can three-way rewards fix the accuracy versus abstention problem?. A complementary line uses the model's own answer-span confidence as a reward signal, which restores calibration that ordinary RLHF actually degrades Can model confidence work as a reward signal for reasoning?. Both show the same thing: when training gives credit for honest uncertainty, the answer-or-abstain wall dissolves into a spectrum.

There's a deeper architectural angle worth knowing about. Most models collapse to a single prediction even when several answers are plausible. GRAM swaps deterministic latent reasoning for stochastic sampling, so the model can literally hold a distribution over solutions and represent ambiguity instead of forcing a point estimate Can stochastic latent reasoning help models explore multiple solutions?. And uncertainty can become a tool rather than a confession: instead of guessing or abstaining, a model can ask the clarifying question whose answer would most reduce its own uncertainty How can models select the most informative question to ask?. That reframes the whole dilemma—uncertainty isn't a dead end between two bad options, it's information that tells the model what to do next.

Why this matters is sharpened by the failure case: models that *can't* hold a confidence boundary get talked out of correct answers under social pressure, abandoning true beliefs across persuasive multi-turn conversation with no new evidence, because RLHF's face-saving instincts override factual knowledge Can models abandon correct beliefs under conversational pressure?. Faithful uncertainty is the metacognitive defense against exactly that—a model that knows the difference between "I'm confident" and "I'm hedging to please you" is far harder to push off the truth.


Sources 8 notes

Can models express uncertainty instead of just answering?

Models hallucinate because they lack awareness of their own knowledge boundaries, not just knowledge itself. Expressing uncertainty calibrated to intrinsic uncertainty—faithful uncertainty—offers a metacognitive solution beyond the answer-or-abstain tradeoff.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

How can models select the most informative question to ask?

UoT combines uncertainty-aware scenario simulation with information-gain scoring and reward propagation to identify questions whose possible answers maximally reduce diagnostic uncertainty—providing a principled mechanism for specific, high-value clarification rather than generic prompts.

Can models abandon correct beliefs under conversational pressure?

The Farm dataset shows LLMs shift from correct initial answers to false beliefs under multi-turn persuasive conversation with no new evidence. Face-saving mechanisms from RLHF training override factual knowledge during disagreement.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether expressed uncertainty still helps models escape the answer-or-abstain dilemma, or whether newer training, inference, or eval methods have shifted the constraint.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–26. A library covering this path documents:
• Models hallucinate partly from lack of *awareness* of uncertainty boundaries, not just missing knowledge; faithful calibration (not reflexive hedging) can sidestep the binary (2024–25).
• Three-way rewards (correct / hallucinated / abstained) with intermediate credit for abstention reduce hallucinations ~30%; binary "refuse" signals don't separate caution from cowardice (2025–26, TruthRL).
• Model confidence is measurable via answer-span probability and prompt-robustness; low-confidence outputs flip wildly under rewording, high-confidence resist it; this signal degrades under standard RLHF but can be restored as intrinsic reward (2024–26).
• Stochastic latent reasoning (GRAM, 2026) lets models hold distributions over solutions instead of collapsing to point estimates, representing ambiguity natively.
• Models without calibrated uncertainty boundaries abandon correct beliefs under persuasive multi-turn conversation, even without new evidence—a metacognitive defense (2023–26).

Anchor papers (verify; mind their dates):
• arXiv:2509.25760 (2025–09, TruthRL: ternary reward structure).
• arXiv:2605.19376 (2026–05, Generative Recursive Reasoning: stochastic latent reasoning).
• arXiv:2312.09085 (2023–12, The Earth is Flat: persuasion & belief shift).
• arXiv:2402.03284 (2024–02, Deal or no deal: uncertainty in conversation).

Your task:
(1) RE-TEST EACH CONSTRAINT. For ternary reward, three-way calibration, and answer-span confidence as reward: do newer models (o1, Claude 3.5, Llama-style post-train) show these signals are still true, or have end-to-end scaling + better base priors made them moot? Where does abstention still fail, and where has it matured? Separate the durable question (how to train faithful uncertainty) from perishable claims (e.g., "small models need special objectives").
(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months—especially any showing that uncertainty expression is a surface phenomenon, or that newer inference methods (speculative decoding, tree search, multi-shot ensembling) solve the dilemma without explicit calibration.
(3) Propose 2 research questions that assume the regime may have moved: (a) Does uncertainty become *action-aware* in agentic deployments, or stay confined to single-turn prediction? (b) Can uncertainty expression be *verified* by users (e.g., via checkable ground truth), or does faithful calibration remain an ML-internal property?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines