Can three-way rewards fix the accuracy versus abstention problem?

Standard RL forces models to choose between accuracy and honesty about uncertainty. Could treating correct answers, hallucinations, and abstentions as distinct reward outcomes let models learn when to say 'I don't know'?

Synthesis note · 2026-02-23 · sourced from Alignment

Standard RL for language models uses binary reward: correct or incorrect. This creates a forced trade-off. Optimizing for accuracy pushes the model to always answer, amplifying hallucinations. Optimizing for caution encourages abstention, sacrificing correct answers. Both extremes compromise truthfulness.

TruthRL introduces a ternary reward that treats correct answers, hallucinations, and abstentions as three distinct outcomes with different reward values. The key insight is that abstention should receive an intermediate reward — not as good as a correct answer, but better than a hallucination. This makes "I don't know" a learnable response that the model can select when genuinely uncertain.

The approach includes knowledge boundary probing: for each training question, 256 responses are sampled. If none is correct, the question is marked as out-of-knowledge (OOK) and relabeled with "I don't know" as the ground truth. This gives the model explicit examples of when abstention is appropriate, based on its own capability boundaries.

Results across four knowledge-intensive benchmarks: 28.9% reduction in hallucinations and 21.1% improvement in truthfulness compared to vanilla RL. Consistent gains across Qwen and Llama backbones under both retrieval and non-retrieval setups.

This directly addresses the problem identified in Does reasoning fine-tuning make models worse at declining to answer?. Standard reasoning training degrades abstention because the binary reward doesn't value it. Ternary reward restores the abstention signal. Similarly, it complements Does binary reward training hurt model calibration? — both papers address the inadequacy of binary rewards, but from different angles: calibration via scoring rules vs truthfulness via ternary outcomes.

Inquiring lines that use this note as a source 49

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 109 in 2-hop network ·dense cluster Open in graph ↗

Can three-way rewards fix the accuracy versus ab… Does reasoning fine-tuning make models worse at de… Does binary reward training hurt model calibration… Can model confidence work as a reward signal for r… Does training objective determine which direction …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does reasoning fine-tuning make models worse at declining to answer? When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
the problem TruthRL solves: standard training destroys abstention capacity
Does binary reward training hurt model calibration? Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
complementary approach: proper scoring rules for calibration vs ternary for truthfulness
Can model confidence work as a reward signal for reasoning? Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
a third approach to the same binary-reward inadequacy
Does training objective determine which direction models fail at abstention? Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
ternary reward directly addresses this bidirectional problem: by making abstention a learnable intermediate-reward option, it provides a mechanism to correct both under-abstention (reasoning-trained) and over-abstention (safety-trained) toward calibrated abstention

Can three-way rewards fix the accuracy versus abstention problem?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 5