Can AI personas reliably replicate human experiment results?
Exploring whether LLM-based persona simulations accurately reproduce experimental findings from published psychology and marketing research, and what factors determine when they succeed or fail.
The Viewpoints AI study systematically replicated 45 experiments from 14 Journal of Marketing articles (2023-2024), creating unique AI persona instances matching original sample sizes and demographics. Each persona received the exact stimuli and measures from the original study.
Results by evidence strength:
- Main effects overall: 76% replicated (84/111)
- Including interaction effects: 68% (90/133)
- Strong original evidence (low p-values): high replication rate
- Marginal effects (higher p-values): declining success; both false positives and false negatives
- Non-significant original effects (p > 0.5): balanced — sometimes correctly identifies absence, sometimes introduces spurious findings
The p-value correlation is the key finding: LLM persona simulations function as a noisy amplifier of existing evidence. Strong effects register clearly; weak effects are in the noise floor. This means persona simulation is useful for confirming robust effects but unreliable for detecting subtle ones — precisely the effects that matter most for advancing theory.
The efficiency argument is compelling regardless: studies that took weeks can be run in minutes, potentially during a single meeting. For applied contexts — pretesting health PSAs, ad variants, social media posts — 76% main effect replication with instant turnaround may be sufficient.
However, the 24% failure rate on main effects (roughly 1 in 4 significant findings producing no difference with AI personas) means ground truth determination is unresolved. Are the human results or the AI results more representative? Since human subjects studies carry their own biases (gender, race, age, cultural context), and LLMs are trained on data containing those same biases, neither can claim definitional accuracy.
Inquiring lines that use this note as a source 54
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do individual persona simulations work?
- Why does belief-specific tailoring work better than demographic personalization?
- Can agent-based simulators replace real-user A/B testing for studying recommendation system harms?
- How do LLM user simulators fail to represent authentic user behavior distributions?
- Can structured empathy measurement frameworks predict persona effectiveness?
- Can proxy evaluation of ideas accurately predict their quality without implementation?
- Do LLMs genuinely internalize human psychological structure or match surface patterns?
- How do LLMs identify which personality items matter most for trait inference?
- Can persona-attention mechanisms explain recommendations better than external surrogate models?
- How do LLM personas compare to demographic targeting?
- What makes personas in multi-agent systems actually contribute meaningful domain depth?
- What distribution patterns appear across different theory-of-mind datasets?
- Why does mimicking human behavior differ from simulating human cognition?
- What role does authentic self-expression play in building accurate personality models?
- Does adding survey data to interviews improve agent accuracy further?
- Why do short interviews outperform demographic labels for persona simulation?
- Can persona-based approaches capture genuine disagreement in expert annotations?
- How do LLMs default to surface-level strategies instead of genuine mental simulation?
- Why does dynamic persona identification outperform fixed personas in prompting?
- How should CASA theory be updated for modern personalized agents?
- How much does omniscient evaluation overstate real-world simulation fidelity?
- Does the Assistant Axis gravitational pull prevent true individual-level persona personalization?
- How does support coverage relate to systematic biases in persona simulation?
- Why do individual persona simulations succeed when population-level representation fails?
- Can demographic personas predict behavior without rich narrative grounding?
- Do stated character beliefs predict decisions better when extracted from text?
- Can persona simulations reliably predict behavior across different scenarios?
- Does the replication crisis in psychology predict similar failures in machine behavior research?
- Can treating simulated users as trainable agents reduce persona consistency drift?
- Can similar profiles amplify systematic biases in persona simulation at scale?
- Does persona-level grouping systematically trigger confidence-misdirection failures in practice?
- Why do current evaluation metrics fail to catch reasoning failures in persona agents?
- Can models converge on similar experience descriptions across different architectures?
- Can Big Five personality models improve synthetic data quality at scale?
- Can advertising mechanisms designed for humans work on agents?
- How much does interview richness matter compared to model capability for persona accuracy?
- Does alignment training intensity push LLM personas from pretense toward realization?
- How do emotional and social simulations enable better hypothetical reasoning?
- Why does persona-level information often fail to predict individual preferences?
- When should persona attention weight activate versus stay dormant during scoring?
- Can role-aligned AI systems replicate an expert's sense of audience and moment?
- Why do marginal effects fail to replicate in AI persona simulations?
- Do LLMs predict social norms more accurately than individual behavior?
- What systematic biases emerge when scaling persona simulation to population level?
- How does AI persona fidelity compare to interview-based generative agents?
- Can a perfect behavioral simulation constitute genuine understanding or experience?
- Why do LLM persona simulations replicate main effects but fail on marginal effects?
- Does model uncertainty overwhelm persona-specific signal in conditioned predictions?
- How much does sparse persona information limit the power of conditioning?
- Can experimental outcomes be reliably distilled into reusable insights?
- Does richer input to LLM personas improve their fidelity to human responses?
- Do realistic LLM behaviors require simulating human thought or just behavior?
- Can persona prompts reliably transfer across different question domains?
- How should persona prompts be used if not for accuracy?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can AI agents learn people better from interviews than surveys?
Can rich interview transcripts seed more accurate generative agents than demographic data or survey responses? This matters because it challenges how we build digital simulations of real people.
85% individual vs 76% experimental; different simulation tasks, different fidelity levels
-
How do we generate realistic personas at population scale?
Current LLM-based persona generation relies on ad hoc methods that fail to capture real-world population distributions. The challenge is reconstructing the joint correlations between demographic, psychographic, and behavioral attributes from fragmented data.
population-level bias may explain the 24% failure rate
-
Can AI systems learn social norms without embodied experience?
Large language models exceed individual human accuracy at predicting collective social appropriateness judgments. Does this reveal that embodied experience is unnecessary for cultural competence, or do systematic AI failures point to limits of statistical learning?
convergent evidence: social norm prediction at 100th percentile and 76% experimental replication both demonstrate LLMs approximating human behavioral data from text alone, but the experimental replication shows the ceiling effect: strong effects replicate while marginal effects are noise, suggesting statistical learning captures cultural consensus better than individual variation
-
Does conditioning LLMs on personal profiles improve prediction?
Persona induction—feeding LLMs participant-specific information—is widely used to make models simulate individuals more accurately. But does it actually work at the individual level where it matters most?
extends: same fault line — main effects survive while individual/marginal effects fail
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings
- Persona Generators: Generating Diverse Synthetic Personas at Scale
- A meta-analysis of the persuasive power of large language models
- A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions
- LLM Generated Persona is a Promise with a Catch
- Exploring the Role of Prior Beliefs for Argument Persuasion
- When Large Language Models are More Persuasive Than Incentivized Humans, and Why
- The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas
Original note title
LLM persona simulations replicate 76 percent of published experimental main effects but accuracy tracks original evidence strength — marginal effects are unreliable