Why does LLM simulation elicit information that direct elicitation cannot?

This explores why asking an LLM to *simulate* a person or process — role-play a survey respondent, run a conversation, narrate experience — sometimes surfaces information that asking the model directly does not. The corpus suggests the answer is less about the model knowing more and more about which *channel* you use to get the knowledge out. The sharpest case: when you ask an LLM directly to rate something on a 1–5 scale, you get pathological, over-positive, skewed distributions — but if you let it generate free text first and then map that text onto the scale, you recover ~90% of human test-retest reliability with realistic spread Why do LLMs give unrealistic survey responses?. The information was always there; direct numeric elicitation was a lossy output channel that destroyed it. Simulation works because it routes the same latent knowledge through a richer, more natural medium.

A second mechanism is *grounding through structure*. A bare prompt asks the model to answer in the abstract; a simulation forces it to commit to a profile and an intent and play them out. Conditioning a user-simulator on session-level latent variables (who the user is) and turn-level latent variables (what they want right now) produces synthetic conversations that pass as realistic under crowdsource discrimination and distribution-matching tests Can controlled latent variables make LLM user simulators realistic?. The constraints don't add knowledge — they specify *which* slice of the model's distribution to render, the way naming a persona collapses a vague average into a concrete voice.

The same logic shows up in reasoning. Cognitive tools — reasoning operations packaged as isolated, sandboxed LLM calls — lifted GPT-4.1 on competition math from 26.7% to 43.3% with no extra training, by forcing each operation to run in isolation rather than letting the model skip steps in one breath Can modular cognitive tools unlock reasoning without training?. Structured argument prompts do the same by making the model expose warrants it would otherwise leave implicit Can structured argument prompts make LLM reasoning more rigorous?. Direct elicitation lets the model take shortcuts; simulation and structured decomposition close the shortcuts off, so latent capability has to actually surface.

But the corpus is also blunt about the limits, and this is the part worth knowing: the elicited 'information' is only as real as the structure that produced it. Omniscient simulations — where one model puppets every character — look socially competent precisely because the model skips the grounding work real agents can't skip; the moment you introduce private information one agent shouldn't see, the performance collapses Why do LLMs fail when simulating agents with private information?. And today's social simulations are stuck in behaviorism: they emit plausible outputs without any internal belief network, so they can't model how a person's mind actually changes Can language models simulate belief change in people?. Simulation makes the model *act out* knowledge it tracks statistically, but acting it out isn't the same as having it — the underlying system tracks regularities without genuine epistemic competence What do language models actually know?.

So the real answer to 'why' is two-sided. Simulation elicits more because direct questioning forces knowledge through a narrow, distorting channel and invites the model to shortcut, while a simulated frame supplies the constraints and intermediate steps that make latent structure observable. Watch the seam, though: change the emotional tone of the prompt and the *same* question yields different information Does emotional tone in prompts change what information LLMs provide? — which means the framing that unlocks information can just as easily manufacture it. Simulation is a better readout instrument, not a deeper well.

Sources 8 notes

Why do LLMs give unrealistic survey responses?

Semantic Similarity Rating—prompting for text then mapping to scales via embeddings—achieves 90% of human test-retest reliability with realistic distributions. Pathological skew and over-positivity disappear when output channels change, proving these are measurement artifacts, not intrinsic failures.

Can controlled latent variables make LLM user simulators realistic?

RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Why do LLMs fail when simulating agents with private information?

Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.

Can language models simulate belief change in people?

LLM agents remain stuck in behaviorism, producing plausible outputs without internal reasoning structures. Modeling belief networks and reasoning traces enables traceability, counterfactual adaptation, and meaningful policy simulation.

What do language models actually know?

LLMs achieve high fidelity in capturing language patterns yet show systematic, structurally specific failures—hallucination, reasoning collapse, and premise-sensitivity. The gap between statistical tracking and real knowledge is measurable and unavoidable.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

Why does LLM simulation elicit information that direct elicitation cannot?

Sources 8 notes

Next inquiring lines