SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Can three-way rewards fix the accuracy versus abstention problem?

Standard RL forces models to choose between accuracy and honesty about uncertainty. Could treating correct answers, hallucinations, and abstentions as distinct reward outcomes let models learn when to say 'I don't know'?

Synthesis note · 2026-02-23 · sourced from Alignment
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Standard RL for language models uses binary reward: correct or incorrect. This creates a forced trade-off. Optimizing for accuracy pushes the model to always answer, amplifying hallucinations. Optimizing for caution encourages abstention, sacrificing correct answers. Both extremes compromise truthfulness.

TruthRL introduces a ternary reward that treats correct answers, hallucinations, and abstentions as three distinct outcomes with different reward values. The key insight is that abstention should receive an intermediate reward — not as good as a correct answer, but better than a hallucination. This makes "I don't know" a learnable response that the model can select when genuinely uncertain.

The approach includes knowledge boundary probing: for each training question, 256 responses are sampled. If none is correct, the question is marked as out-of-knowledge (OOK) and relabeled with "I don't know" as the ground truth. This gives the model explicit examples of when abstention is appropriate, based on its own capability boundaries.

Results across four knowledge-intensive benchmarks: 28.9% reduction in hallucinations and 21.1% improvement in truthfulness compared to vanilla RL. Consistent gains across Qwen and Llama backbones under both retrieval and non-retrieval setups.

This directly addresses the problem identified in Does reasoning fine-tuning make models worse at declining to answer?. Standard reasoning training degrades abstention because the binary reward doesn't value it. Ternary reward restores the abstention signal. Similarly, it complements Does binary reward training hurt model calibration? — both papers address the inadequacy of binary rewards, but from different angles: calibration via scoring rules vs truthfulness via ternary outcomes.

Inquiring lines that use this note as a source 49

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 109 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

ternary reward that distinguishes correct answers hallucinations and abstentions solves the accuracy-abstention trade-off in RL for truthfulness