Why do reasoning models fail at predicting disagreement?
RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?
RLVR training optimizes for tasks with single correct answers — math solutions, code outputs, deterministic verifications. This optimization has a side effect: RLVR-trained models significantly degrade at predicting the distribution of human annotation disagreements, particularly when annotation variance is high. The models become better at deterministic goals (predicting the majority annotation) but worse at probabilistic goals (predicting the proportion of disagreements).
The contrast with RLHF models is revealing. For RLHF-trained models, Chain-of-Thought reasoning significantly improves disagreement prediction. For RLVR models, forcing additional reasoning effort does not improve — and can worsen — disagreement prediction. The reasoning pathways that RLVR develops are optimized for convergence toward a single answer, not for representing the legitimate spread of human interpretations.
This connects to a broader pattern: since Why do readers interpret the same sentence so differently?, tasks that require capturing this multiplicity are structurally mismatched with RLVR's optimization signal. The verifiable reward framework assumes one right answer exists. Many real-world annotation tasks involve multiple valid perspectives — precisely the scenario where RLVR models fail.
Since Do standard NLP benchmarks hide LLM ambiguity failures?, majority-label evaluation conceals this degradation. A model that perfectly predicts the majority vote may be useless at capturing the 40% of annotators who disagree — and that disagreement often carries the most informative signal about task subjectivity and sample ambiguity.
The pattern connects to a broader optimization cost: since Does preference optimization harm conversational understanding?, both RLVR and RLHF exhibit the same narrowing dynamic through different mechanisms. RLHF optimizes for single-turn helpfulness, eroding conversational grounding acts; RLVR optimizes for deterministic correctness, eroding sensitivity to legitimate interpretive variance. Both sacrifice multiplicity for confidence.
Since Does binary reward training hurt model calibration?, RLVR's disagreement degradation is a specific case of binary reward's calibration failure. A binary correct/incorrect signal cannot represent the distribution of human disagreement — it structurally encodes the assumption that one answer exists. The calibration fix (adding a proper scoring rule) addresses confidence-accuracy alignment but not the deeper problem of variance suppression in inherently multi-answer tasks.
The practical implication for using LLM annotators: RLVR models may be actively worse than non-reasoning models for subjective annotation tasks. The reasoning that helps with math actively hurts with ambiguity.
Inquiring lines that use this note as a source 10
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does RLHF alignment reduce the diversity of viewpoints in AI output?
- Does user preference for confirmation override model capability for disagreement?
- Why does social accommodation in collaborative reasoning mask actual disagreement?
- What makes factual verification difficult in inter-model debate?
- Do chain-of-thought prompts help RLVR models predict annotation disagreement?
- Can proper scoring rules fix RLVR's degradation on disagreement prediction?
- Are RLVR models worse than non-reasoning models for subjective annotation?
- Why does RLHF alone fail to fully prevent opinion copying?
- Why do high-disagreement tasks benefit from broad rater pools over deep annotation?
- Why do reasoning-optimized models show no resistance advantage on agreement tasks?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do readers interpret the same sentence so differently?
How much of annotation disagreement in NLP reflects genuine interpretive multiplicity rather than error? This explores whether social position and moral framing systematically generate competing but equally valid readings.
RLVR optimization suppresses exactly this multiplicity
-
Do standard NLP benchmarks hide LLM ambiguity failures?
When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?
majority-label evaluation hides the degradation
-
Why do LLM persona prompts produce inconsistent outputs across runs?
Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
persona instability meets RLVR variance suppression
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
RLVR disagreement failure is a specific case of this general pattern
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
parallel narrowing: RLHF erodes grounding acts, RLVR erodes variance sensitivity; both sacrifice multiplicity for optimization target
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
binary reward is the structural mechanism: correct/incorrect cannot represent disagreement distributions
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Can Large Language Models Capture Human Annotator Disagreements?
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Absolute Zero: Reinforced Self-play Reasoning with Zero Data
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
- Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
Original note title
rlvr reasoning models degrade at predicting human annotation disagreements — optimization for deterministic answers suppresses sensitivity to legitimate variance