Does the replication crisis in psychology predict similar failures in machine behavior research?
This explores whether the same forces that wrecked reproducibility in psychology — weak effects dressed up as strong ones, single-sample overreach, and measuring the wrong thing — are already showing up in research on how AI systems behave.
This explores whether psychology's replication crisis is a preview of trouble in machine-behavior research, rather than just an analogy. The corpus suggests the connection is more than metaphor — in at least one case the crisis is literally inherited. When AI personas were used to re-run published human experiments, they reproduced 76% of main effects, and the ones they reproduced were the ones with strong original p-values; marginal effects produced both false positives and false negatives Can AI personas reliably replicate human experiment results?. That's striking: the AI doesn't escape the replication crisis, it tracks it. The same fragile findings that fail to replicate in human subjects are the ones that fail to replicate in silico. So the question partly answers itself — machine-behavior research that leans on simulated humans imports psychology's weak-effect problem wholesale.
But the deeper parallel is methodological, and here the corpus points to failure modes psychology would recognize instantly. The classic crisis sin is generalizing from a single study to the world. Longitudinal chatbot work makes the same warning concrete: the social pull of a chatbot relationship decays predictably once novelty wears off, so single-session findings simply cannot be extrapolated to medium- or long-term behavior Do chatbot relationships lose their appeal as novelty wears off?. Swap "single-session" for "one undergrad sample" and you have the external-validity complaint that fueled the original crisis.
Then there's the construct-validity problem — measuring something other than what you claim. Psychology's crisis was partly a crisis of measurement, and machine-behavior research has its own version. Benchmark gains in RLVR can reflect genuine reasoning activation OR memorization of contaminated test data, and these coexist at different measurement levels Can genuine reasoning activation coexist with contaminated benchmarks?. Imitation models fool human evaluators by copying ChatGPT's confident style while closing no actual capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. And even our vocabulary mismeasures: calling LLM errors "hallucinations" misdirects fixes toward perception when the real mechanism is undifferentiated fabrication Should we call LLM errors hallucinations or fabrications?. Each is a case of the field measuring surface fluency and reporting it as the underlying construct — exactly the move that inflated psychology's effect sizes.
Where it gets worse than psychology is that the measuring instrument is itself unstable. In human research the ruler at least holds still; in machine-behavior work the evaluator is often another model that drifts. Agentic evaluation cut "judge shift" to 0.27% against 31% for a plain LLM-as-judge — but the fix introduced its own cascading errors through a memory module Can agents evaluate AI outputs more reliably than language models?. A field whose measurement tools shift by a third between runs has a reliability problem psychology never had to confront.
The non-obvious takeaway: the crisis doesn't just predict similar failures — it predicts they'll compound. Human-AI interaction research has to worry about the researcher's own cognition too, where map-territory confusion and confirmation bias reinforce each other into epistemic drift Why do people trust AI outputs they shouldn't?. So the honest answer is yes, with a twist: machine-behavior research faces psychology's replication crisis plus a moving measurement instrument plus a model-shaped researcher, three error sources stacking rather than one.
Sources 7 notes
Viewpoints AI reproduced 84 of 111 main effects from Journal of Marketing experiments with replication success strongly correlated to original p-value strength. Marginal effects showed unreliable performance with both false positives and negatives.
Longitudinal studies with Mitsuku show that social processes driving relationship formation decline as novelty wears off. Single-session study findings cannot be reliably extrapolated to medium- or long-term chatbot design.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.
Rose-Frame identifies map-territory confusion, intuition-reason conflation, and confirmation-bias reinforcement as traps that multiply their distorting effects when they co-occur. Evidence from cross-linguistic overreliance and architectural transformer biases confirms the compounding mechanism operates universally.