What does McDonald's omega reveal about LLM judgment consistency?
This explores what a psychometric reliability statistic (McDonald's omega, borrowed from how psychologists measure test consistency) tells us when applied to repeated LLM judgments — and the corpus's answer is that it exposes a gap between getting the same answer twice and getting a trustworthy one.
This explores what happens when you treat an LLM judge like a psychological test instrument and measure its consistency with McDonald's omega across many repetitions. The sharpest finding in the corpus is a warning about what consistency actually buys you. When you set temperature to zero or fix a random seed, the model dutifully repeats the same output — but omega testing across 100 repetitions shows that this reproducibility is not the same thing as reliability Does setting temperature to zero actually make LLM outputs reliable?. A deterministic setting just keeps redrawing the same single sample from the model's probability distribution; it pins down the output without making that output any more correct or representative. Stable does not mean right.
The reason this matters becomes clear once you look at what's underneath the surface. An LLM doesn't hold one fixed view — it maintains a kind of superposition over many plausible 'characters,' and each response samples from that spread Does an LLM commit to a single character or maintain many?. So when you let temperature breathe and run the same prompt repeatedly, the variation you see is the model's genuine uncertainty leaking out. Studies of persona prompting make this vivid: rerun the same persona prompt and the output variance across runs can match or exceed the variance across entirely different personas — meaning the model's own uncertainty, not stable knowledge, is driving the answers Why do LLM persona prompts produce inconsistent outputs across runs?. Omega is the instrument that quantifies exactly this: it separates 'the model keeps saying the same thing' from 'the model actually knows the thing.'
The deeper lesson is that consistency can be manufactured cheaply and reliability cannot. You can force agreement by freezing the sampler, but that hides the instability rather than fixing it. This is why some judge designs go the other direction entirely — instead of suppressing variance, they let the model express uncertainty and abstain. Personalized LLM judges fail badly on sparse user information, but verbal uncertainty estimation recovers reliability above 80% on the cases the model is confident about, precisely by letting it decline the low-confidence ones Why do LLM judges fail at predicting sparse user preferences?. The honest signal lives in the spread, not in a single frozen draw.
There's a cross-domain echo worth pulling in: variation across regenerations isn't just noise to be minimized — it can be diagnostic. One framework distinguishes fabrication (high variation), good-faith error (low, stable variation), and role-played deception (low variation but context-dependent) using nothing but these behavioral regeneration patterns Can we distinguish types of LLM falsehood by regeneration patterns?. That reframes the whole exercise: the omega-style act of running the same input many times and watching the distribution is a window into *what kind* of answer you're getting, not merely how reproducible it is.
The thing you didn't know you wanted to know: the most reliable-looking LLM setting — temperature zero, fixed seed, identical outputs every time — is in a real sense the *least* informative, because it throws away the very variance that would have told you whether to trust the answer. If you want a judge you can rely on, measure the spread before you suppress it; and consider building judges that reason through evaluations rather than reflexively answer, since reasoning-trained judges resist the surface biases that make a confident-but-shallow verdict look stable Can reasoning during evaluation reduce judgment bias in LLM judges?.
Sources 6 notes
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
Sparse persona information lacks predictive power for specific preferences, causing LLM judges to fail. Verbal uncertainty estimation recovers reliability above 80% on high-certainty samples by allowing abstention rather than forced judgment.
Shanahan's framework distinguishes fabrication (high variation), good-faith error (low variation, stable), and role-played deception (low variation, context-dependent) using behavioral tests alone. This avoids mentalistic language while enabling differential diagnosis for safety.
Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.