Do high-disagreement items signal contested values or measurement noise?

This explores whether annotator disagreement on an item means people genuinely hold different values, or just that the measurement is unreliable — and how you'd tell the two apart.

This explores a question that sounds like a binary — is disagreement signal or noise? — but the corpus reframes it: disagreement is itself a *mixture*, and the real skill is separating the strands rather than picking a side. The most direct handle comes from work showing that annotation responses don't all measure the same thing: they decompose into genuine preferences, non-attitudes (people who don't really have a view but answer anyway), and constructed preferences (made up on the spot) — and crucially, these are distinguishable by whether a person answers consistently across different measurement conditions Do all annotation responses measure the same underlying thing?. That gives you the test: noise wobbles when you re-ask the same thing differently; contested values stay stable but point in opposite directions.

Once you apply that test, a lot of high-disagreement disappears into the 'noise' bucket — but not all of it, and the leftover is where things get interesting. Research on interpretation modeling argues that disagreement on socially embedded sentences is *irreducibly* multiple: readers in different social positions genuinely read the same sentence differently, and the spread of interpretations carries real information rather than marking a labeling failure Why do readers interpret the same sentence so differently?. Here the disagreement is the data. You see the same effect in persuasion research, where what a reader already believes predicts the outcome better than anything the speaker says — meaning 'disagreement' is partly just audiences with different priors talking past the linguistic content Does what readers believe matter more than what debaters say?.

The sharpest stakes show up downstream, in how we build reward models. A single aggregated reward model literally cannot represent a 51-49 split: it has to either make 49% of people unhappy always, or make everyone unhappy half the time — a *representational* failure, not a quality problem Can aggregate reward models satisfy genuinely disagreeing users?. So if you mistake contested values for noise and average them away, you're not cleaning the data — you're structurally erasing a minority. That's the cost of guessing wrong about which kind of disagreement you have.

The noise side has its own subtlety worth borrowing. Two methods normally used for model uncertainty double as noise-detectors for annotation: testing whether 'consistent' outputs are actually *reliable* samples versus just repeated draws from the same distribution Does setting temperature to zero actually make LLM outputs reliable?, and clustering responses by meaning to measure how much the *semantics* (not the surface wording) diverge Can we detect when language models confabulate?. Both formalize the same intuition behind the three-signal test: re-sample under varied conditions and watch what holds.

The thing you didn't know you wanted to know: the question isn't 'values or noise' — it's whether your *measurement design* even lets you tell them apart. If you only ask once, in one framing, the two are indistinguishable by construction, and any answer you give about contested values is itself an artifact of lazy measurement.

Sources 6 notes

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Why do readers interpret the same sentence so differently?

Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.

Does what readers believe matter more than what debaters say?

Analysis of debate corpora shows that political and religious ideology labels of voters outpredict linguistic features when modeling debate outcomes. Language effects observed without reader controls are confounded by audience composition correlated with debate topics.

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Can we detect when language models confabulate?

Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.

Do high-disagreement items signal contested values or measurement noise?

Sources 6 notes

Next inquiring lines