How much noise comes from rater idiosyncrasy versus selection bias?
This explores the two very different kinds of error that creep into evaluation data — the random scatter of individual human raters versus the systematic distortion baked in by what data you collect in the first place — and which one actually does the damage.
This explores the two very different kinds of error in evaluation data: rater idiosyncrasy (random scatter from individual human judgment) versus selection bias (systematic distortion from which examples ever get seen or labeled). The corpus doesn't put a single number on the ratio, but it makes a sharper point — these two noise sources behave so differently that lumping them together as "noise" hides the real problem. Idiosyncratic rater error is roughly uncorrelated across people, so it averages out; selection bias is correlated and structural, so it compounds.
The clearest illustration of why idiosyncrasy is the *tractable* kind comes from work on training across many imperfect experts Can models trained on many imperfect experts outperform each one?. When you aggregate many raters or experts whose mistakes point in different directions, cross-entropy optimization effectively takes an implicit majority vote and denoises the uncorrelated individual errors — the consensus can outperform any single rater. That only works because the errors are independent. The moment errors share a common cause, averaging stops helping. And a lot of what looks like "rater" variation is actually a shared prior: cognitive biases in models (and arguably in human labelers too) are planted upstream and merely nudged later Where do cognitive biases in language models come from?, meaning some apparent idiosyncrasy is really correlated bias wearing a disguise.
Selection bias is the dangerous one because it doesn't wash out — it feeds back. YouTube's ranking work argues you have to model selection bias *explicitly*, with a dedicated mechanism, or the system converges on degenerate equilibria that amplify its own past decisions Why do ranking systems need to model selection bias explicitly?. The data you collect is shaped by what the model previously surfaced, so the bias isn't random scatter you can sample your way out of — it's a loop that gets stronger over time. No amount of more raters fixes a sampling process that systematically never shows you the cases where you're wrong.
Which connects to a quieter failure mode: the errors that *concentrate* rather than scatter. Fluent, confident, wrong answers cluster precisely in the rare cases where harm occurs, and aggregate accuracy masks them because overall performance still looks strong Why do confident wrong answers hide in standard accuracy metrics?. That's selection bias at the metric level — your evaluation set under-samples exactly the region where the model fails. Even your measurement of "reliability" can be fooled: a deterministic, zero-temperature output is perfectly consistent yet still just one draw from a distribution Does setting temperature to zero actually make LLM outputs reliable?, so low rater variance can give false comfort that the underlying judgment is sound.
The practical upshot, if you're trying to clean up an evaluation pipeline: idiosyncratic rater noise is the cheap problem — add raters, aggregate, denoise. Selection bias is the expensive one, and it has to be designed against structurally, not sampled against. Stronger judging machinery helps with consistency — agentic evaluators with evidence collection cut judge instability dramatically Can agents evaluate AI outputs more reliably than language models? — but a more reliable judge applied to a biased sample just reliably measures the wrong thing. The thing worth knowing you wanted to know: chasing rater agreement can make your numbers look better while the bias that actually matters sits untouched in what you chose to measure.
Sources 6 notes
Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.
Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.