Can reward design fix the conflict between reasoning accuracy and abstention calibration?

This explores whether you can design a reward signal that makes a model both more accurate at reasoning AND honest about when it should decline to answer — instead of having to trade one off against the other.

This explores whether reward design can resolve the tension between getting answers right and knowing when to keep quiet — and the corpus suggests the conflict is largely an artifact of how rewards are shaped, not a law of nature. The root cause is surprisingly mechanical: a plain binary correctness reward (right = +1, wrong = 0) quietly teaches a model to bluff, because a confident wrong guess and a hedged wrong guess are scored identically, so there's no reason not to guess confidently. One note shows this isn't just empirical but provable — binary rewards degrade calibration — and that adding a Brier-score term as a second objective mathematically guarantees you can optimize accuracy and calibration together with no trade-off Does binary reward training hurt model calibration?. That single result reframes the whole question: the 'conflict' is something reward design created and reward design can remove.

From there the corpus offers several flavors of fix that converge on the same insight — abstention has to be made *learnable* by giving it its own place in the reward structure. The most direct is a ternary reward: correct gets +1, hallucination gets -1, and abstention gets an intermediate value, so 'I don't know' becomes a rational choice rather than a scored failure. Across four benchmarks this cut hallucinations by nearly 29% while preserving accuracy and truthfulness Can three-way rewards fix the accuracy versus abstention problem?. A different angle reaches calibration without any new labels at all: use the model's own answer-span confidence as the reward signal, which both sharpens step-by-step reasoning and reverses the calibration damage that standard RLHF inflicts Can model confidence work as a reward signal for reasoning?. Two routes, same lesson — the information needed to abstain well is already latent in training; binary rewards just throw it away.

The interesting cross-current is *why* scalar rewards keep producing this problem in the first place. Several notes argue that a single number is simply too thin a channel. Agent feedback, for instance, decomposes into two orthogonal kinds of information — how well an action did (evaluative) and how it should change (directive) — and a scalar reward can carry the first but discards the second Can scalar rewards capture all the information in agent feedback?. In the same spirit, natural-language critiques break through performance plateaus that pure numerical rewards can't, precisely because the number never says *why* something failed Can natural language feedback overcome numerical reward plateaus?. Calibration is exactly the kind of nuance that gets flattened when you compress judgment into one scalar — which is why richer reward structures, not just bigger ones, are what move the needle.

There's also a structural design principle worth borrowing here: *how* you wire a signal in matters as much as *what* it measures. One note shows that using rubrics as gates — to accept or reject whole rollout groups — prevents reward hacking far better than converting those same rubric scores into dense rewards, preserving the categorical 'is this acceptable' judgment while letting fine-grained rewards optimize only within valid answers Can rubrics and dense rewards work together without hacking?. Map that onto abstention and a clean architecture emerges: gate on whether the model *should* answer at all, then optimize accuracy inside the answers it does commit to — structurally separating the two objectives the binary reward had collapsed together.

One caveat keeps the optimism honest. A cluster of notes finds that reward-based RL mostly activates reasoning strategies already latent in the base model rather than installing new ones — it sharpens sampling efficiency within existing boundaries rather than expanding them What does reward learning actually do to model reasoning?, Does RLVR actually expand what models can reason about?. Read against this question, that's actually reassuring rather than deflating: good calibration is fundamentally about *eliciting* knowledge the model already has — surfacing its own uncertainty — not about teaching it new facts. Which is precisely the kind of thing reward design is good at. So the honest answer is yes, with a boundary: reward design can resolve the accuracy-versus-abstention conflict, because that conflict was a reward-shaping artifact all along — but it does so by making the model honest about what it knows, not by making it know more.

Sources 8 notes

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Can reward design fix the conflict between reasoning accuracy and abstention calibration?** A curated library (2024–2026) claims this tension is largely resolvable — but the findings are dated. Your job is to separate what still holds from what newer work may have already overturned.

**What a curated library found — and when (dated claims, not current truth):**
- Binary correctness rewards provably degrade calibration; adding a proper scoring rule (e.g., Brier score) as a second objective eliminates the trade-off mathematically (~2024–2025).
- Ternary rewards (correct +1, hallucination −1, abstention intermediate) cut hallucinations by ~29% while preserving accuracy (~2025).
- Model confidence as an intrinsic reward signal both sharpens reasoning and reverses calibration damage from standard RLHF (~2025).
- Natural-language feedback breaks performance plateaus that pure numerical rewards cannot; decomposing reward signals into evaluative + directive channels captures nuance scalars discard (~2025–2026).
- Rubric gates (categorical feasibility filters) prevent reward hacking better than dense scalar conversion; structural separation of feasibility and optimization outperforms collapsed objectives (~2025–2026).
- Reward-based RL primarily *elicits* latent reasoning rather than *expanding* capability boundaries beyond the base model (~2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2506.03106 (Critique-GRPO, ~2025): Natural language + numerical feedback in LLM reasoning.
- arXiv:2506.13351 (Direct Reasoning Optimization, ~2025): Token-level reasoning + rubric gates.
- arXiv:2507.14843 (The Invisible Leash, ~2025): Scope limits of RLVR.
- arXiv:2509.25760 (TruthRL, ~2025): Truthfulness via RL incentives.

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every claim above — multi-objective reward removal of accuracy-vs-calibration trade-off, ternary reward efficacy, confidence-as-signal, rich feedback over scalars, rubric gating, elicitation vs. expansion — judge whether newer models (e.g., o1, latest Claude/GPT variants), scaling, multi-agent orchestration, or improved evaluation have since RELAXED, OVERTURNED, or CONFIRMED it. Distinguish the durable question (how to design rewards for both accuracy and honest abstention?) from perishable limitations (e.g., "binary rewards always fail"). Cite what changed it, or state plainly where the constraint still appears to hold.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Has any recent paper shown that reward design cannot resolve the conflict, or that the conflict vanishes without reward innovation?
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "If multi-modal reward signals now scale effortlessly, does the abstention-calibration conflict re-emerge at frontier reasoning tasks?" or "Can reward design remain the bottleneck if model scale, not signal design, now dominates calibration?"

**Guardrail:** Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can reward design fix the conflict between reasoning accuracy and abstention calibration?

Sources 8 notes

Next inquiring lines