How does unidimensionality in assessments affect measurement validity?
This explores what goes wrong when an assessment collapses something genuinely multi-dimensional into a single score or signal — and why that flattening, not bad measurement technique, is often where validity breaks.
This explores how treating a multi-faceted thing as if it had one dimension undermines whether your measurement actually measures what you think. The corpus circles this from several directions, and the recurring lesson is that the damage happens upstream of the math: the moment you decide a complex phenomenon fits on a single axis, you've already discarded the information that validity depends on.
The clearest case is in annotation and feedback. Human ratings don't all measure the same underlying thing — they decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, which only separate out when you vary the measurement conditions Do all annotation responses measure the same underlying thing?. Treat them as one uniform signal and you contaminate everything downstream. The same shape shows up in agent feedback, which carries two orthogonal channels — evaluative (how good was that?) and directive (how should it change?) — that a single scalar reward simply cannot hold at once Can scalar rewards capture all the information in agent feedback?. Unidimensionality here isn't a simplification; it's a deletion.
Where researchers build assessments deliberately, they tend to refuse the single axis. Prompt quality resolves into six dimensions grounded in communication theory, not a flat checklist Can we measure prompt quality independent of model outputs?. Social intelligence needs seven simultaneous dimensions, because scoring only goal-achievement misses believability, relationship, and social rules entirely Can social intelligence be measured across seven dimensions?. And alignment turns out to be several non-interchangeable things — lexical alignment buys task efficiency while emotional and prosodic alignment buy trust — so collapsing them produces category errors like a cold support bot that scored 'aligned' Do different types of alignment serve different conversational goals?. A high score on the wrong single dimension is worse than no score, because it looks valid.
There's a subtler failure too: a unidimensional metric can be perfectly consistent and still invalid. Zero-temperature settings produce the same output every time, but that repeatability isn't reliability — it's one draw from a distribution, and McDonald's omega across repetitions exposes the gap Does setting temperature to zero actually make LLM outputs reliable?. Imitation models exploit exactly this: they nail the single dimension a human evaluator eyeballs — confident, fluent style — while closing no actual capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. If your assessment only reads one axis, anything that optimizes that axis will fool it.
The payoff the corpus hints at is that you can have rigorous single-number validity once you've earned the dimensions first — LLEAP reaches an omega of 0.953 rating therapy engagement precisely because it builds the construct properly before scoring it Can local language models rate therapy engagement reliably?. And the design move that protects validity is keeping distinct dimensions categorical rather than mashing them into one continuous reward: rubrics used as accept/reject gates resist gaming, whereas rubrics flattened into dense scores get hacked Can rubrics and dense rewards work together without hacking?. The thing you didn't know you wanted to know: unidimensionality rarely fails by being inaccurate — it fails by being confidently precise about the wrong quantity.
Sources 9 notes
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.
SOTOPIA framework operationalizes social intelligence across Goal, Believability, Knowledge, Secret, Relationship, Social Rules, and Financial dimensions. Humans produce 16.8 words per turn versus GPT-4's 45.5, revealing efficiency as a measurable capability in social interaction.
A 2020–2025 systematic review shows lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive relational warmth and trust. Conflating them in design produces category errors—cold customer-service bots and evasive mental-health assistants.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
LLEAP achieved reliability (omega=0.953) and valid correlations with motivation, effort, and symptom outcomes using Llama 3.1 8B to rate 1,131 therapy sessions, while keeping data locally stored.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.