Do chain-of-thought prompts help RLVR models predict annotation disagreement?
This explores whether adding chain-of-thought reasoning to RLVR-trained models can recover their ability to predict how human annotators legitimately disagree — and the corpus suggests the answer is mostly no, because the two techniques pull against the same underlying capability.
This explores whether chain-of-thought prompts can help RLVR-trained models predict human annotation disagreement — and the corpus points toward a discouraging answer, because the problem CoT would need to fix is one that RLVR training actively created. The starting point is that RLVR optimization suppresses a model's sensitivity to legitimate disagreement: trained to chase a single deterministic 'correct' answer, these models degrade sharply at representing the spread of valid human interpretations, especially when that spread is wide Why do reasoning models fail at predicting disagreement?. So the question is really whether bolting reasoning steps onto the prompt can reopen a door that the training procedure welded shut.
The trouble is that chain-of-thought may not be the kind of tool that can do that. One line of work argues CoT doesn't unlock genuine new inference — it constrains the model to replay familiar reasoning shapes from training, which is why it breaks down predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?. If reasoning chains are imitation of form rather than new capability, they can't supply a sensitivity the model was optimized to lose. That echoes a broader ceiling: prompting only reorganizes what's already in the training distribution and can't inject a capacity that isn't there Can prompt optimization teach models knowledge they lack?. Predicting disagreement well isn't a reasoning puzzle the model can think its way through; it's a representational property RLVR eroded.
There's also reason to think more reasoning could make things worse, not just fail to help. Verbose CoT has been shown to degrade tasks when the real bottleneck lies elsewhere — optimizing long rationales trains the wrong target while the actual signal (visual attention, in that case) goes untouched Does verbose chain-of-thought actually help multimodal perception tasks?. Disagreement prediction looks like the same trap: collapsing toward a confident single answer is exactly what longer deterministic reasoning encourages. And the evidence that most CoT tokens serve style rather than computation Can minimal reasoning chains match full explanations? suggests the extra reasoning isn't adding the distributional nuance you'd need anyway.
What the corpus implies is that disagreement isn't noise to be reasoned away — it's structured signal. Annotation responses decompose into genuinely different types (real preferences, non-attitudes, constructed preferences) that demand different handling rather than a single correct collapse Do all annotation responses measure the same underlying thing?, and interpretation of socially loaded sentences is irreducibly multiple across reader perspectives, not a failure of annotation Why do readers interpret the same sentence so differently?. A model whose whole objective is to converge on one answer is structurally mismatched to a target that is plural by nature — and the persona-simulation literature shows the related failure where prompt-conjured 'perspectives' produce variance driven by model uncertainty rather than stable social knowledge Why do LLM persona prompts produce inconsistent outputs across runs?.
The thing you might not have expected to learn: the fix here probably isn't a better prompt at all. Confidence and convergence — the very things RLVR and reasoning chains reward Does model confidence predict robustness to prompt changes? — are liabilities when the goal is to faithfully mirror human disagreement. The capability you'd want lives upstream in the training objective, not downstream in the prompt.
Sources 9 notes
RLVR-trained models degrade significantly at predicting human disagreement distributions, especially when variance is high. The optimization signal for deterministic correctness actively erodes the model's ability to represent multiple valid interpretations.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.