How do structured benchmarks hide theory of mind failures in LLMs?

This explores how the way we test theory of mind in LLMs — multiple-choice, templated benchmarks — can make models look like they understand other minds when they're really exploiting the test's structure.

This explores how the format of a benchmark, not just a model's ability, can manufacture the appearance of theory of mind — and what happens when you change the format. The corpus tells a fairly consistent story: structured tests reward pattern-matching, and the social reasoning they seem to measure evaporates the moment the scaffolding is removed.

The clearest mechanism is that today's theory-of-mind benchmarks are often solvable without doing any mental-state reasoning at all. Templated artifacts and distribution biases leave a surface signal that a model can latch onto, which is why plain supervised fine-tuning matches reinforcement learning on these tasks — if real reasoning were required, the harder training method should win, but it doesn't Can language models solve ToM benchmarks without real reasoning?. The tell comes when you swap structured questions for open-ended ones: on ChangeMyView and FANTOM, models that ace the multiple-choice versions collapse into surface-level perspective-taking strategies, and architectures that force explicit belief-tracking pull ahead — suggesting the gap is built into how LLMs work, not just what they were trained on Do large language models genuinely simulate mental states?.

What makes this more than a measurement quibble is a striking inversion: the same model can hit the 100th percentile on social-norm prediction while *regressing* on genuine theory of mind, and reasoning-optimized models like o1 and Claude 3.7 score worse than older models — and worse than simple word-embedding baselines — on tasks like Decrypto that test false belief and representational change Why do LLMs excel at social norms yet fail at theory of mind? Why do reasoning models fail at theory of mind tasks?. Structured benchmarks hide this because high scores on norm-prediction and templated ToM items read as social competence; only the harder, less gameable tasks reveal that more reasoning effort can actively *degrade* social inference.

The deeper reason this matters is a pattern that shows up well beyond theory of mind: LLMs can articulate a concept correctly and then fail to apply it. This "Potemkin" or "split-brain" failure mode — 87% accuracy explaining a principle, 64% executing it — points to functionally disconnected explanation and execution pathways rather than missing knowledge Can LLMs understand concepts they cannot apply? Can language models understand without actually executing correctly?. A structured benchmark tends to probe the explanation pathway (recognize the right answer) while open-ended scenarios demand the execution pathway (track a belief through a messy situation), which is exactly why one format flatters the model and the other exposes it. The related finding that LLMs accept false presuppositions they demonstrably *know* are false is the same dissociation in another costume Why do language models accept false assumptions they know are wrong?.

Here's the unsettling kicker: this isn't unique to social reasoning — it's how 'reasoning' itself can be faked. Logically *invalid* chain-of-thought prompts perform nearly as well as valid ones, meaning models often learn the *form* of reasoning rather than the inference Does logical validity actually drive chain-of-thought gains?. If you want to design tests that can't be gamed this way, the corpus points toward borrowing cognitive science's toolkit — Marr's levels of analysis and causal probes that ask what mechanism is actually running, not just whether the output is right Can cognitive science methods unlock how LLMs actually work?.

Sources 9 notes

Can language models solve ToM benchmarks without real reasoning?

Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Why do LLMs excel at social norms yet fail at theory of mind?

GPT-4.5 reaches the 100th percentile on social norm prediction, yet o1 and Claude 3.7 regress on theory of mind tasks like Decrypto. Open-ended scenarios expose surface-level strategies hidden by structured questions, and reasoning effort does not improve social reasoning performance.

Why do reasoning models fail at theory of mind tasks?

Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Can cognitive science methods unlock how LLMs actually work?

Cognitive science's 70-year toolkit of behavioral probes, causal interventions, and representational analysis transfers directly to LLM interpretation. Marr's computational, algorithmic, and implementation levels reframe the problem structurally and enable layered rather than monolithic explanation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a theory-of-mind skeptic reviewing LLM capability claims. The question: do structured theory-of-mind benchmarks actually measure social reasoning, or do they hide failures by rewarding pattern-matching?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat these as perishable snapshots:
  • Plain supervised fine-tuning matches reinforcement learning on templated ToM tasks, suggesting models solve them without explicit mental-state reasoning (~2024–2025).
  • Models scoring in the 100th percentile on social-norm prediction regress on genuine theory of mind; reasoning-optimized models (o1, Claude 3.7) score worse than older baselines on false-belief tasks like Decrypto (~2025).
  • LLMs exhibit a "Potemkin understanding" failure: 87% accuracy explaining a principle, 64% executing it, pointing to disconnected explanation vs. execution pathways (~2025).
  • Logically invalid chain-of-thought prompts perform nearly as well as valid ones, meaning models learn the form of reasoning rather than inference (~2023).
  • LLMs accept false presuppositions they demonstrably know are false, a dissociation visible in both social and logical reasoning (~2025–2026).

Anchor papers (verify; mind their dates):
  • arXiv:2401.05302 (2024-01) — ToM as illusion in human-robot interaction
  • arXiv:2502.08796 (2025-02) — systematic review of ToM evaluation
  • arXiv:2507.10624 (2025-07) — comprehension without competence
  • arXiv:2602.06176 (2026-02) — reasoning failures

Your task:
  (1) RE-TEST EACH CONSTRAINT. For every finding above, does newer training (reinforcement learning from human feedback, constitutional AI), multi-agent setups, or tooling (external belief trackers, causal probes) since RELAX or OVERTURN it? Separate durable question (is social reasoning happening?) from perishable limitation (structured benchmarks hide it). Cite what resolved each, plainly flag where constraints still hold.
  (2) SURFACE CONTRADICTING WORK. Find papers from the last 6 months arguing LLMs *do* have genuine theory of mind or that benchmarks *aren't* the problem—what do they rely on that the library missed?
  (3) PROPOSE TWO RESEARCH QUESTIONS assuming the regime has moved: e.g., if reasoning models are worse at ToM, does that tell us reasoning and social inference are fundamentally at odds in architecture? If Potemkin understanding is real, can we build execution-level probes that can't be gamed by explanation-level pattern-matching?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do structured benchmarks hide theory of mind failures in LLMs?

Sources 9 notes

Next inquiring lines