What distinguishes verifiable rewards from preference-based rewards in unified training?
This explores the dividing line between two reward types — verifiable rewards (was the answer objectively correct?) versus preference-based rewards (did a judge or human prefer this output?) — and what happens when training tries to fold both into one signal.
This explores the dividing line between rewards you can *check* (math is right, code compiles) and rewards you can only *prefer* (this answer reads better, follows instructions more faithfully) — and what the corpus says about combining them. The short version: the two differ less in mechanism than in what they're allowed to certify, and the interesting work is happening at the seam where one is converted into the other.
Start with what verifiable rewards actually do. A recurring and slightly deflating finding is that reinforcement learning from verifiable rewards (RLVR) doesn't teach models new reasoning — it surfaces strategies already latent in pretraining. Pass@k analysis shows base models beating RLVR models at high sampling budgets Does RLVR actually expand what models can reason about?, and the activation framing is echoed across multiple notes: a single training example can suffice, and even spurious rewards work nearly as well as correct ones for well-pretrained models What does reward learning actually do to model reasoning?, How does RL training reshape reasoning and what gets lost?. So 'verifiable' buys you a sharp, hackable-resistant signal — but a narrow one. It catalyzes; it doesn't expand.
Preference-based rewards have the opposite profile: broad coverage of subjective quality, but soft and gameable. The corpus catalogs the damage. Binary correctness rewards quietly degrade calibration because they never punish confident wrong answers — a flaw fixed by bolting on a proper scoring rule like Brier score as a second term Does binary reward training hurt model calibration?. Holistic preference models overfit to superficial artifacts, which is why instruction-following gets decomposed into verifiable checklist sub-criteria Can breaking down instructions into checklists improve AI reward signals?. The most useful way to read these is as a spectrum, not a binary: 'unified training' is really the project of converting fuzzy preferences into checkable units without losing what made them broad.
The sharpest distinction the corpus draws is structural: how you *combine* the two matters more than which you use. One note shows that rubrics work best as **gates** that accept or reject whole rollout groups, while dense token-level rewards optimize *within* the survivors — converting rubric scores directly into dense rewards invites hacking, but separating feasibility (preference-like) from optimization (verifiable-like) preserves both Can rubrics and dense rewards work together without hacking?. A related insight: scalar feedback can't jointly carry everything. Agent feedback decomposes into *evaluative* signal (how good — what rewards capture) and *directive* signal (how to change — what they discard), making the two complementary rather than substitutable Can scalar rewards capture all the information in agent feedback?. Ternary rewards make the same move by splitting one axis into three — correct, hallucinated, abstained — so abstention becomes learnable instead of collapsed into 'wrong' Can three-way rewards fix the accuracy versus abstention problem?.
Here's the thing you might not have known you wanted: the verifiable/preference boundary may be dissolving from both ends. Reward models are growing reasoning traces and test-time compute, behaving less like fixed verifiers and more like deliberating judges Can reward models benefit from reasoning before scoring?. Meanwhile, a model's *own* confidence can manufacture synthetic preferences that improve reasoning and restore calibration with no human labels or external verifier at all Can model confidence work as a reward signal for reasoning?. The late-2025 literature converges on three substitutable patterns where the policy's internal computations replace the reward model, the critic, and the explicit reward signal entirely Can language models replace reward models with internal signals? — even belief-shift toward a solution becomes its own dense intrinsic reward Can an agent's own beliefs guide credit assignment without critics?. The unified picture isn't 'verifiable plus preference.' It's a continuum where the model increasingly generates both kinds of signal from inside itself.
Sources 12 notes
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.