Can RL with verifiable rewards improve dialogue quality better than preference optimization?
This explores whether reward signals you can measure objectively (emotion trajectories, checklists, consistency metrics) produce better conversations than training models to match human preference judgments — and the corpus suggests the answer hinges on what 'dialogue quality' even means.
This explores whether reward signals you can measure objectively beat preference optimization for making conversations better — and the corpus's most striking move is to first show what preference optimization quietly breaks. Standard RLHF, it turns out, optimizes for looking helpful in a single turn: it rewards confident answers over clarifying questions, and in doing so cuts the 'grounding acts' that make multi-turn dialogue reliable by over 77% below human levels Does preference optimization harm conversational understanding?. The same pathology shows up as passivity — models trained on next-turn rewards stop asking questions and stop discovering what the user actually wants Why do language models respond passively instead of asking clarifying questions?. So the question isn't just 'is verifiable better' — it's 'does preference optimization have a structural blind spot that something more measurable can fix?'
Where verifiable rewards shine is exactly where they can name the thing they're optimizing. RLVER uses a simulated user's emotion trajectory as the reward and shifts models from solution-dumping toward genuine empathy — notably *without* the usual trade-off where chasing one quality erodes conversational grounding Can emotion rewards make language models genuinely empathic?. Other work decomposes fuzzy goals into checkable sub-criteria: turning 'follow this instruction well' into a checklist of verifiable items reduces the overfitting-to-superficial-cues that plagues holistic preference models Can breaking down instructions into checklists improve AI reward signals?. And consistency itself can be the verifier — training user simulators against prompt-to-line and line-to-line consistency metrics cuts persona drift by 55% Can training user simulators reduce persona drift in dialogue?.
But the corpus plants a sharp caution flag against treating 'verifiable' as a magic word. RLVR, the verifiable-reward method most studied in reasoning, doesn't actually expand what a model can do — it narrows sampling toward solutions already in the base model's distribution, improving efficiency rather than capability Does RLVR actually expand what models can reason about?. And binary correctness rewards — the simplest verifiable signal — provably *degrade* calibration, teaching models to guess confidently because a wrong-but-confident answer isn't penalized Does binary reward training hurt model calibration?. The verifier you choose is the behavior you get; a crude one breeds new failure modes.
The most interesting lateral thread is that the binary preference-vs-verifiable framing may be a false choice. Some methods manufacture preferences *from* a verifiable internal signal — ranking reasoning traces by the model's own answer-span confidence to build synthetic preferences that improve reasoning while reversing RLHF's calibration damage Can model confidence work as a reward signal for reasoning?. Others reframe dialogue itself as a single optimizable trajectory rather than a stack of separate judgments — unified policy learning beats deciding what-to-ask, what-to-recommend, and when separately, because isolated components can't share gradient signal Can unified policy learning improve conversational recommender systems?.
The takeaway you didn't know you wanted: 'dialogue quality' is too coarse to answer the question as posed. Verifiable rewards win decisively for the specific, nameable qualities preference optimization neglects — empathy, persona consistency, grounding, multi-turn collaboration — precisely because preference models collapse those into a single helpfulness signal. But for the holistic, hard-to-specify feel of a good conversation, an ill-chosen verifier just trades one blind spot for another. The frontier isn't picking a side; it's building verifiers good enough that 'verifiable' and 'preferred' stop pointing in different directions.
Sources 9 notes
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
RLVER uses a simulated user's emotion trajectory as an RL reward signal, enabling GRPO to deliver stable empathy improvements while maintaining dialogue quality—countering the typical trade-off between preference optimization and conversational grounding.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Research shows that formulating attribute-asking, item-recommending, and timing decisions as a single graph-based RL policy achieves better joint optimization than isolated components. Separation prevents gradient signals from informing one another and fails to optimize conversation trajectory holistically.