Does the replication crisis in psychology predict similar failures in machine behavior research?

This explores whether the same forces that wrecked reproducibility in psychology — weak effects dressed up as strong ones, single-sample overreach, and measuring the wrong thing — are already showing up in research on how AI systems behave.

This explores whether psychology's replication crisis is a preview of trouble in machine-behavior research, rather than just an analogy. The corpus suggests the connection is more than metaphor — in at least one case the crisis is literally inherited. When AI personas were used to re-run published human experiments, they reproduced 76% of main effects, and the ones they reproduced were the ones with strong original p-values; marginal effects produced both false positives and false negatives Can AI personas reliably replicate human experiment results?. That's striking: the AI doesn't escape the replication crisis, it tracks it. The same fragile findings that fail to replicate in human subjects are the ones that fail to replicate in silico. So the question partly answers itself — machine-behavior research that leans on simulated humans imports psychology's weak-effect problem wholesale.

But the deeper parallel is methodological, and here the corpus points to failure modes psychology would recognize instantly. The classic crisis sin is generalizing from a single study to the world. Longitudinal chatbot work makes the same warning concrete: the social pull of a chatbot relationship decays predictably once novelty wears off, so single-session findings simply cannot be extrapolated to medium- or long-term behavior Do chatbot relationships lose their appeal as novelty wears off?. Swap "single-session" for "one undergrad sample" and you have the external-validity complaint that fueled the original crisis.

Then there's the construct-validity problem — measuring something other than what you claim. Psychology's crisis was partly a crisis of measurement, and machine-behavior research has its own version. Benchmark gains in RLVR can reflect genuine reasoning activation OR memorization of contaminated test data, and these coexist at different measurement levels Can genuine reasoning activation coexist with contaminated benchmarks?. Imitation models fool human evaluators by copying ChatGPT's confident style while closing no actual capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. And even our vocabulary mismeasures: calling LLM errors "hallucinations" misdirects fixes toward perception when the real mechanism is undifferentiated fabrication Should we call LLM errors hallucinations or fabrications?. Each is a case of the field measuring surface fluency and reporting it as the underlying construct — exactly the move that inflated psychology's effect sizes.

Where it gets worse than psychology is that the measuring instrument is itself unstable. In human research the ruler at least holds still; in machine-behavior work the evaluator is often another model that drifts. Agentic evaluation cut "judge shift" to 0.27% against 31% for a plain LLM-as-judge — but the fix introduced its own cascading errors through a memory module Can agents evaluate AI outputs more reliably than language models?. A field whose measurement tools shift by a third between runs has a reliability problem psychology never had to confront.

The non-obvious takeaway: the crisis doesn't just predict similar failures — it predicts they'll compound. Human-AI interaction research has to worry about the researcher's own cognition too, where map-territory confusion and confirmation bias reinforce each other into epistemic drift Why do people trust AI outputs they shouldn't?. So the honest answer is yes, with a twist: machine-behavior research faces psychology's replication crisis plus a moving measurement instrument plus a model-shaped researcher, three error sources stacking rather than one.

Sources 7 notes

Can AI personas reliably replicate human experiment results?

Viewpoints AI reproduced 84 of 111 main effects from Journal of Marketing experiments with replication success strongly correlated to original p-value strength. Marginal effects showed unreliable performance with both false positives and negatives.

Do chatbot relationships lose their appeal as novelty wears off?

Longitudinal studies with Mitsuku show that social processes driving relationship formation decline as novelty wears off. Single-session study findings cannot be reliably extrapolated to medium- or long-term chatbot design.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Should we call LLM errors hallucinations or fabrications?

LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Why do people trust AI outputs they shouldn't?

Rose-Frame identifies map-territory confusion, intuition-reason conflation, and confirmation-bias reinforcement as traps that multiply their distorting effects when they co-occur. Evidence from cross-linguistic overreliance and architectural transformer biases confirms the compounding mechanism operates universally.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research methodology auditor. This question remains open: Does the replication crisis in psychology predict similar failures in machine behavior research, or have newer models, evaluation methods, or field practices since 2024–2026 decoupled the two?

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2026. Key constraints identified:
• AI personas re-running published psychology experiments reproduced 76% of main effects, tracking the same fragility as human replication; marginal effects produced both false positives and false negatives (~2024–25).
• Chatbot relationship effects decay predictably over repeated interaction; single-session findings cannot extrapolate to medium/long-term behavior (~2024).
• RLVR benchmark gains conflate genuine reasoning with memorized contaminated test data at different measurement layers (~2025); CoT mimics reasoning constraint rather than enabling true reasoning (~2025).
• LLM-as-judge evaluators shift by ~31% across runs; agentic judges reduce this to 0.27% but introduce cascading memory-module errors (~2025–26).
• Measurement vocabulary ("hallucinations" vs. fabrication) misdirects fixes toward wrong constructs (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2408.16073 (2024-08): AI personas replicate human experiment fragility at 76%.
• arXiv:2305.15717 (2023-05): Imitation captures style, not factuality.
• arXiv:2510.14665 (2025-10): Illusion of understanding in LLM outputs.
• arXiv:2605.20025 (2026-05): Human-AI collaboration in autonomous research; self-reinforcing loops.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 76% replication, evaluator drift, and CoT-reasoning conflation, judge whether post-2026 training methods (mechanistic interpretability, synthetic data filtering, stricter dataset audits), orchestration (multi-agent verification, formal verification layers), or new benchmarks have relaxed these. Separate the durable question (does machine-behavior inherit psych's weak-effect problem?) from perishable limitations (does evaluator drift still matter if we've fixed the pipeline?). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—e.g., papers claiming agentic evaluation or mechanistic interpretability have *solved* measurement stability, or showing machine behavior and human behavior diverge in ways that break the analogy entirely.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "If agentic verification now holds evaluator drift below 2%, does the replication problem narrow to construct validity alone?" or "Does self-reinforcing autonomous research (2606.20025) create *new* error cascades that psychology never faced?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does the replication crisis in psychology predict similar failures in machine behavior research?

Sources 7 notes

Next inquiring lines