Why do NLP benchmarks treat annotation disagreement as noise rather than signal?
This explores why standard NLP evaluation throws away disagreement between human annotators — treating it as error to average out — when that disagreement might actually be carrying real information.
This explores why standard NLP evaluation throws away disagreement between human annotators — treating it as error to average out — when that disagreement might actually be carrying real information. The short answer the corpus offers: benchmarks are built around a 'gold label' assumption, where every example has one correct answer and any annotator who deviates is simply wrong. That assumption is convenient — it lets you score a model with a single number — but it quietly bakes in a worldview that not all questions actually have one answer.
The strongest pushback comes from work on interpretation. Research on socially embedded sentences finds that readers genuinely interpret the same sentence differently depending on their social position, and that this spread is *valid variation*, not sloppy labeling Why do readers interpret the same sentence so differently?. When you collapse that distribution to a majority vote, you don't clean up noise — you erase a real signal about how meaning depends on who's reading. A related decomposition shows annotation responses aren't even one kind of thing: they mix genuine preferences, non-attitudes, and on-the-spot constructed preferences, each of which behaves differently and should be handled differently. Treating them all as interchangeable contaminates everything downstream, including reward-model training and alignment Do all annotation responses measure the same underlying thing?.
There's also a self-interested reason benchmarks prefer agreement: ambiguity is exactly where models look worst. When you filter out the examples annotators disagree on — which is standard practice — you remove the test cases that would expose a model's failure to recognize ambiguity at all. One study found a 32% vs. 90% accuracy gap that's completely invisible to conventional evaluation, precisely because the hard cases were filtered out before scoring Do standard NLP benchmarks hide LLM ambiguity failures?. So 'disagreement as noise' isn't just a methodological habit — it's a filter that flatters the systems being measured.
The lateral point worth sitting with: disagreement-as-signal echoes a pattern that shows up elsewhere in the corpus, where the *minority* carries the learning. In reasoning models, only about 20% of tokens — the high-entropy 'forking points' where the model is genuinely uncertain — drive most of the improvement; training on just those matches full training Do high-entropy tokens drive reasoning model improvements?. Uncertainty itself turns out to be informative rather than disposable: calibrated uncertainty estimates beat elaborate heuristics for deciding when a model should retrieve Can simple uncertainty estimates beat complex adaptive retrieval?, and measuring divergence across sampled meanings is how you catch a model confabulating Can we detect when language models confabulate?. Across all of these, the thing the old paradigm wants to average away — the spread, the disagreement, the entropy — is where the actual information lives.
So the deeper reason benchmarks treat disagreement as noise is that they inherited an evaluation contract designed for tasks with real right answers, and never renegotiated it for tasks where human variation is the phenomenon, not the error bar. The reader curious to pull this thread will find the corpus arguing, from several directions at once, that uncertainty and disagreement are signal you can measure — not mess you should clean.
Sources 6 notes
Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
By filtering out examples where annotators disagree, benchmarks remove test cases that would reveal LLM failures at ambiguity recognition. Research using ambiguous examples shows a 32% vs. 90% accuracy gap invisible to standard evaluation.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.