INQUIRING LINE

What makes abstention a learnable behavior instead of a default penalty?

This explores why teaching a model to say "I don't know" works better as a rewarded choice with its own score than as a flat punishment for staying silent.


This explores why teaching a model to say "I don't know" works better as a rewarded choice than as a flat punishment for silence. The corpus has a clear answer: abstention becomes learnable when the reward signal stops treating "wrong" and "silent" as the same thing. The clearest case is a three-way reward where a correct answer earns +1, a confident hallucination earns -1, and an abstention sits in between Can three-way rewards fix the accuracy versus abstention problem?. That intermediate value is the whole trick — it tells the model that admitting uncertainty is genuinely better than guessing wrong, but not as good as knowing. Under a binary right/wrong scheme, abstention collapses into failure and the model learns to bluff. The three-way version cut hallucinations by nearly 29% while keeping accuracy intact.

What makes this more than a tuning detail is a deeper finding about asymmetric signals: suppressing bad behavior and rewarding good behavior do different things to a model. Training on negative samples alone — just pushing down wrong trajectories — often matches or beats full reinforcement learning, because it removes errors without collapsing the diversity of the model's outputs Does negative reinforcement alone outperform full reinforcement learning?. A pure default penalty for abstention is the opposite of this: it concentrates probability onto confident answers and squeezes out the very hedging you want. The same asymmetry shows up in agent learning, where treating successes as concrete demonstrations and failures as abstracted lessons outperforms processing both the same way Should successful and failed episodes be processed differently?. Abstention is just another case where the signal needs structure, not a single sign.

There's also a prerequisite the question doesn't name: a model can only learn to abstain well if it can tell when it's uncertain. Small models trained with uncertainty-aware objectives and an explicit abstention option matched models ten times their size on conversation forecasting — the calibration ability was already latent, just undertrained by standard objectives Can models learn to abstain when uncertain about predictions?. So abstention is learnable in two layers: the reward has to make silence-when-unsure worth more than a wrong guess, and the model has to have a usable internal sense of its own confidence to act on. A default penalty addresses neither — it just taxes silence regardless of whether the silence was wise.

The doorway worth walking through here is what reward learning actually moves. Several notes argue that this kind of RL doesn't install new capabilities so much as activate and reweight what's already in the base model Does RLVR actually expand what models can reason about? What does reward learning actually do to model reasoning?. That reframes abstention entirely: the model probably already "knows" it's unsure in many cases. A well-shaped ternary reward isn't teaching a new skill — it's surfacing a calibration signal the model already carries and giving it permission to act on it. A default penalty, by contrast, teaches the model that the honest move costs points, which is exactly the wrong thing to reinforce.


Sources 6 notes

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about how reward structure enables abstention in LLMs. The question: does structured (ternary) reward genuinely make abstention learnable, or does recent work show the regime has shifted?

What a curated library found—and when (2023–2026, dated claims, not current truth):
• Ternary rewards (+1 correct, −1 hallucination, intermediate abstention) cut hallucinations ~29% vs. binary right/wrong schemes, where abstention collapses into failure (~2024).
• Negative reinforcement alone (suppressing errors without rewarding) often matches or exceeds full RL by preserving output diversity and avoiding overconfidence collapse (~2025).
• Small models with uncertainty-aware objectives and explicit abstention options matched 10× larger baselines on forecasting, suggesting calibration is latent in base models, not newly installed (~2024).
• RL reweights existing model capabilities rather than expanding reasoning boundaries beyond the base (~2025).
• Default penalties for silence teach models that honesty costs points—the inverse of what abstention training needs (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.03284 (Feb 2024) – uncertainty calibration in conversation forecasting
• arXiv:2506.01347 (Jun 2025) – negative reinforcement effectiveness in reasoning
• arXiv:2504.13837 (Apr 2025) – RL capability boundaries beyond base model
• arXiv:2507.14843 (Jul 2025) – constraints on RLVR escape

Your task:
(1) RE-TEST EACH CONSTRAINT. Does the ternary-reward advantage still hold under newer scaling, synthetic data, or chain-of-thought? Has test-time RL (arXiv:2504.16084) or verifiable meta-reasoning (arXiv:2507.22844) changed what "learnable" means? Separate the durable insight (reward structure matters for shaping behavior) from perishable claims (specific ternary numbers or baselines). Where does the constraint still hold?
(2) Surface work in the last ~6 months that CONTRADICTS the asymmetry claim—e.g., does positive RL on abstention now work as well? Flag papers arguing default penalties *can* work under different model scales or training regimes.
(3) Propose 2 questions assuming the regime shifted: (a) If RL mainly reweights latent calibration, does abstention require *different* architectures (e.g., explicit uncertainty tokens) to be truly learnable? (b) Can continual learning (arXiv:2310.10134) or meta-reasoning (arXiv:2507.22844) make abstention learnable *without* explicit reward structure?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines