Why is the Judging preference constant while other traits vary slightly?

This explores why one measured trait (a 'Judging' preference) holds steady while neighboring traits drift a little — really a question about what makes some preferences stable across repeated measurement and others not.

This reads as a question about stability: when you measure a set of traits repeatedly and one stays locked while the rest wobble, what separates the anchor from the drifters? The corpus doesn't have a paper named for the Judging dimension, but it has a surprisingly direct answer hiding under different vocabulary — the difference between a *genuine* preference and a constructed one. Behavioral-science work on annotation finds that responses decompose into at least three signal types: genuine preferences, non-attitudes, and on-the-spot constructed preferences, and the way you tell them apart is precisely consistency across measurement conditions Do all annotation responses measure the same underlying thing?. A trait that barely moves is behaving like a genuine preference; the ones that shift a little are likely picking up non-attitude or construction noise. The constant trait isn't more 'important' — it's the one the underlying system actually holds an attitude about.

The drift in the other traits has well-documented sources. The same rater gives the same item noticeably different scores across sessions — shifts of multiple stars — driven by temporal inconsistency, anchoring, and rater-specific style rather than any change in true preference Why do the same users rate items differently each time?. So small variation in most traits is the expected baseline; what's unusual and worth explaining is the trait that *refuses* to move. That refusal often tracks confidence: when a model is highly confident, its output resists rephrasing and perturbation, while low confidence produces large swings Does model confidence predict robustness to prompt changes?. A rock-steady Judging score is the signature of a high-confidence region of the distribution; the slightly-varying traits sit nearer the model's decision boundaries.

There's a cautionary cross-current, though. Constancy can be manufactured rather than meaningful. Fixed seeds and zero temperature reproduce the same output every run, yet that output is still just one draw from the distribution — consistency that does not equal reliability Does setting temperature to zero actually make LLM outputs reliable?. If the Judging preference is constant because the measurement procedure pins it (a deterministic setting, a leading prompt, a sparse persona that always defaults the same way), the stability tells you about your instrument, not the trait. LLM-as-judge work shows exactly this trap: with sparse persona information the judge fails, and the fix is letting it abstain rather than forcing a confident-looking answer Why do LLM judges fail at predicting sparse user preferences?.

And context matters more than a single stable number suggests. Perceived personality from speech flips by situation — acoustic cues that signal extraversion in a neutral interview instead read as neuroticism under stress Does personality sound the same in stressful and neutral conversations?. A trait that looks constant in one elicitation context may not survive a change of setting, so 'constant' is always relative to the conditions you tested. The thing you didn't know you wanted to know: a perfectly stable trait is genuinely ambiguous evidence — it's the fingerprint of a real, confidently-held preference *and* the fingerprint of a measurement that can't move. The only way to tell which is to vary the conditions and see whether the constancy holds.

Sources 6 notes

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Why do the same users rate items differently each time?

Amatriain et al. found that the same user gives substantially different ratings to the same item across sessions, shifting by multiple stars. This noise stems from temporal inconsistency, rater-specific biases, and anchoring effects—making ratings reflect both preference and rating-behavior rather than stable preference alone.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Why do LLM judges fail at predicting sparse user preferences?

Sparse persona information lacks predictive power for specific preferences, causing LLM judges to fail. Verbal uncertainty estimation recovers reliability above 80% on high-certainty samples by allowing abstention rather than forced judgment.

Does personality sound the same in stressful and neutral conversations?

Acoustic features that signal extraversion in neutral interviews instead predict neuroticism under stress. Handcrafted acoustic features outperform neural embeddings, suggesting personality is conveyed through specific measurable behaviors rather than holistic speaker style.

Why is the Judging preference constant while other traits vary slightly?

Sources 6 notes

Next inquiring lines