Can aggregate survey realism coexist with unreliable fine-grained effects?
This explores whether LLMs can reproduce believable population-level survey patterns while still getting the individual or fine-grained effects wrong — and whether those two facts can be true at once.
This explores whether LLMs can reproduce believable population-level survey patterns while still getting the individual or fine-grained effects wrong — and whether those two facts can be true at once. The corpus says yes, emphatically, and even explains the mechanism. The clearest statement comes from work on causal simulation: LLMs guided by structural causal models recover effect *directions* reliably but not effect *magnitudes* Can structural causal models automate social science with language models?. That's exactly the split the question names — the coarse shape of the aggregate looks right, while the precise size of any one effect is untrustworthy. Directional social science survives; point estimates don't.
The survey-realism research sharpens this. Pathological skew and over-positivity in LLM survey responses turn out to be *measurement artifacts* of how you elicit the answer, not intrinsic model failures — eliciting free text and mapping it to scales via embeddings recovers ~90% of human test-retest reliability with realistic distributions Why do LLMs give unrealistic survey responses?. So aggregate realism is real and recoverable. But realism at the distribution level says nothing about whether any single simulated respondent is a faithful person, which is where fine-grained reliability quietly breaks.
Why the two layers come apart is worth seeing. A consistent output is not a reliable one: zero-temperature determinism just replays one draw from the model's distribution, and repeated-sampling tests show consistency ≠ reliability Does setting temperature to zero actually make LLM outputs reliable?. And the thing being measured isn't even one thing — annotation responses decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, distinguishable only by consistency across conditions Do all annotation responses measure the same underlying thing?. Aggregating washes these together into a plausible mean while the individual signal stays noisy.
There's also a structural reason the aggregate can lie *by design*. A single model trained on pooled preferences literally cannot represent a 51-49 disagreement — it must either always side with the majority or please everyone half the time Can aggregate reward models satisfy genuinely disagreeing users?. The aggregate looks coherent precisely because it has erased the fine-grained variance that would make it unreliable. Relatedly, simulations look most competent exactly where they cheat: LLMs handle social scenarios well when one model puppets every party, then fail once agents must hold private information Why do LLMs fail when simulating agents with private information? — apparent population-level fluency resting on grounding work skipped at the individual level.
The takeaway a curious reader might not expect: aggregate realism and fine-grained unreliability don't just coexist — the first can *cause* the illusion of the second being solved. Use these simulations the way the causal-models paper recommends — to read the direction of an effect, generate hypotheses, rank options — and treat any specific magnitude, individual respondent, or minority signal as something you still have to verify against humans. Crowdsourced preference at scale works for the same reason: it's the diverse aggregate that's trustworthy, validated against expert raters, not any one vote Can crowdsourced votes reliably rank language models?.
Sources 7 notes
LLMs guided by structural causal models can propose and test causal hypotheses across negotiation, bail, interview, and auction scenarios. Simulations reveal effect directions reliably but not magnitudes, making them useful for directional social science.
Semantic Similarity Rating—prompting for text then mapping to scales via embeddings—achieves 90% of human test-retest reliability with realistic distributions. Pathological skew and over-positivity disappear when output channels change, proving these are measurement artifacts, not intrinsic failures.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.
Chatbot Arena's 240K+ crowdsourced preference votes produce credible model rankings because the underlying questions are diverse and discriminating, and crowd judgments correlate with expert raters—validating human preference as a scalable evaluation signal.