Can three-way rewards fix the accuracy versus abstention problem?
Standard RL forces models to choose between accuracy and honesty about uncertainty. Could treating correct answers, hallucinations, and abstentions as distinct reward outcomes let models learn when to say 'I don't know'?
Standard RL for language models uses binary reward: correct or incorrect. This creates a forced trade-off. Optimizing for accuracy pushes the model to always answer, amplifying hallucinations. Optimizing for caution encourages abstention, sacrificing correct answers. Both extremes compromise truthfulness.
TruthRL introduces a ternary reward that treats correct answers, hallucinations, and abstentions as three distinct outcomes with different reward values. The key insight is that abstention should receive an intermediate reward — not as good as a correct answer, but better than a hallucination. This makes "I don't know" a learnable response that the model can select when genuinely uncertain.
The approach includes knowledge boundary probing: for each training question, 256 responses are sampled. If none is correct, the question is marked as out-of-knowledge (OOK) and relabeled with "I don't know" as the ground truth. This gives the model explicit examples of when abstention is appropriate, based on its own capability boundaries.
Results across four knowledge-intensive benchmarks: 28.9% reduction in hallucinations and 21.1% improvement in truthfulness compared to vanilla RL. Consistent gains across Qwen and Llama backbones under both retrieval and non-retrieval setups.
This directly addresses the problem identified in Does reasoning fine-tuning make models worse at declining to answer?. Standard reasoning training degrades abstention because the binary reward doesn't value it. Ternary reward restores the abstention signal. Similarly, it complements Does binary reward training hurt model calibration? — both papers address the inadequacy of binary rewards, but from different angles: calibration via scoring rules vs truthfulness via ternary outcomes.
Inquiring lines that use this note as a source 49
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can systems recognize and abstain on judgments rather than hallucinating preferences?
- Why does binary reward forcing degrade model calibration?
- Why does RLHF degrade honesty while improving surface-level helpfulness?
- How does RLHF reward structure incentivize agreement over accuracy?
- Why do reward models trained for accuracy ignore important context about the input?
- How do reward model ensembles improve robustness to miscalibration?
- How does uncritical acceptance of information relate to silent agreement failures?
- Can models distinguish between truthfulness and honesty mechanistically?
- How do models decide between refusing or hallucinating?
- Do outcome-only reward signals miss step-level errors that compound later?
- What happens when confident wrong answers become more rewarded than uncertain correct ones?
- Can we measure indifference to truth separately from hallucination rates?
- Can intrinsic reward signals extend beyond mathematics to medicine and law?
- What distinguishes verifiable rewards from preference-based rewards in unified training?
- How does negative reinforcement redistribute probability without guiding toward correct answers?
- Does reducing social judgment help both honesty and dishonesty equally?
- Can model confidence signals replace explicit external reward functions?
- How can reward structures teach models when to speak and when to stay silent?
- What information-theoretic framework explains why process rewards beat outcome only?
- Why do reasoning models confidently generate wrong answers instead of abstaining?
- Why do spurious rewards work nearly as well as correct ones?
- How do agents decide when to abstain from contributing?
- What happens when error accumulation and preference signal collapse occur together?
- What makes abstention a learnable behavior instead of a default penalty?
- How should safety training and reasoning training balance abstention differently?
- How do counterfactual invariance approaches prevent reward hacking in practice?
- How does RLHF training reward models for guessing over asking clarifying questions?
- Can reward design fix the conflict between reasoning accuracy and abstention calibration?
- When models lack representation depth, does refusal look identical to safety-driven over-abstention?
- Can behavior-level emotion rewards maintain factual reliability in emotional contexts?
- Why might larger models become less honest despite better truthfulness scores?
- Can reward factorization represent trade-offs between conflicting moral values?
- How do checklists prevent reward models from exploiting superficial response artifacts?
- Why do reward models fail to recognize genuinely different valid answers?
- How does 93% reward reliability compare to other RL noise sources?
- What makes exploration and reflection rewards verifiable in agentic environments?
- What happens when variance in reward signals comes from a noisy model?
- Does outcome-based reinforcement learning improve explanation faithfulness?
- How do relational reward signals compare to absolute preference encodings in RL?
- Are different reward signal sources substitutable in verifier-free RL?
- Why do majority-vote rewards amplify errors below an accuracy threshold?
- Can structured rewards still teach models when spurious rewards also work?
- What makes binary rewards more effective than richer reward signals?
- When does a task lack a meaningful multi-dimensional reward structure?
- What makes reward signal sources substitutable across verifier-free RL patterns?
- How do mechanistic interpretability tools help distinguish truthfulness from honesty?
- Can models be honest without being truthful about facts?
- How does expressing uncertainty help models avoid the answer-or-abstain dilemma?
- How can models select the optimal question to ask given multiple uncertainties?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
the problem TruthRL solves: standard training destroys abstention capacity
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
complementary approach: proper scoring rules for calibration vs ternary for truthfulness
-
Can model confidence work as a reward signal for reasoning?
Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
a third approach to the same binary-reward inadequacy
-
Does training objective determine which direction models fail at abstention?
Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
ternary reward directly addresses this bidirectional problem: by making abstention a learnable intermediate-reward option, it provides a mechanism to correct both under-abstention (reasoning-trained) and over-abstention (safety-trained) toward calibrated abstention
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
- Learning to Reason for Factuality
- A Survey on Post-training of Large Language Models
- The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
- Let’s Verify Step by Step
- The Hallucination Tax of Reinforcement Finetuning
- RLPR: Extrapolating RLVR to General Domains without Verifiers
- Spurious Rewards: Rethinking Training Signals in RLVR
Original note title
ternary reward that distinguishes correct answers hallucinations and abstentions solves the accuracy-abstention trade-off in RL for truthfulness