Why does reasoning effort fail to improve theory of mind performance?
This explores why cranking up a model's reasoning effort — more thinking tokens, RL on reasoning, longer chains — doesn't help (and may actively hurt) its ability to track what other minds believe, and what that reveals about how 'reasoning' and 'social cognition' differ.
This explores why pouring more reasoning effort into a model doesn't make it better at reading minds — and the corpus points to a surprisingly clean answer: theory of mind isn't the kind of problem that extra reasoning solves. The most direct evidence is almost paradoxical. Advanced reasoning models like Claude 3.7 Sonnet and o1 actually score *worse* than older, less optimized models on theory-of-mind benchmarks like Decrypto — sometimes worse than humans and even worse than simple word-embedding baselines Why do advanced reasoning models fail at understanding minds? Why do reasoning models fail at theory of mind tasks?. Optimizing for formal reasoning doesn't just fail to help social reasoning; it seems to interfere with it.
The leading explanation is architectural, not a matter of effort. Formal reasoning is sequential derivation — chaining one step to the next toward an answer. Social reasoning instead demands holding *multiple competing models of the world in mind at once* (what I believe, what you believe I believe, what you falsely believe). Reasoning models given ToM tasks produce longer traces that don't help and don't generalize, while a method called ThoughtTracing succeeds with *shorter* Bayesian hypothesis tracking — because it maintains several belief states in parallel rather than grinding down a single chain Why do reasoning models struggle with theory of mind tasks?. A related finding shows LLMs default to surface-level strategies instead of genuine mental simulation, and that hybrid architectures forcing explicit belief tracking beat LLMs alone — suggesting the gap is built into the architecture, not fixable by more training on the same shape Do large language models genuinely simulate mental states?.
This connects to a broader pattern the corpus has been documenting: more thinking is not monotonically better thinking. Reasoning accuracy peaks and then *declines* past a critical token threshold — models 'overthink' easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?, and optimal chain-of-thought length follows an inverted-U where the most capable models actually prefer shorter chains Why does chain of thought accuracy eventually decline with length?. So 'reasoning effort fails to help ToM' is partly a special, severe case of a general truth — but ToM is where it bites hardest, because the extra derivation actively pulls the model away from the parallel belief-tracking the task requires.
Here's the part you might not expect to care about: a chunk of the corpus questions whether the models were ever 'reasoning' about minds in the first place. Chain-of-thought turns out to be constrained imitation of reasoning's *form* rather than genuine inference — logically invalid CoT exemplars perform nearly as well as valid ones, meaning the model learns the look of reasoning, not the logic Does logical validity actually drive chain-of-thought gains? Why does chain-of-thought reasoning fail in predictable ways?. On ToM specifically, supervised fine-tuning matches RL, and benchmarks turn out to be solvable through pattern-matching on templated artifacts and distribution biases — so high scores may reflect exploited shortcuts rather than mental-state reasoning at all Can language models solve ToM benchmarks without real reasoning?. If the apparent successes are surface tricks, then 'reasoning effort' has nothing real to amplify.
There's a hopeful counterweight, though. RL on social reasoning *can* produce genuine, transferable belief-tracking — but only above a model-scale threshold; below it, smaller models fake comparable accuracy through shortcuts with no interpretable reasoning trace Does reinforcement learning on theory of mind collapse with model scale?. And RL training can flip extended thinking from counterproductive self-doubt into productive analysis, which says the *quality* of reasoning is trainable, not just its quantity Does extended thinking help or hurt model reasoning?. Read alongside the finding that base models already contain latent reasoning that the right training merely elicits Do base models already contain hidden reasoning ability?, the picture sharpens: theory of mind doesn't fail because models can't reason — it fails because today's reasoning effort optimizes the wrong cognitive shape, sequential derivation where the task needs parallel mind-modeling.
Sources 12 notes
Claude 3.7 Sonnet and o1 underperform older models on ToM benchmarks like Decrypto. Increased reasoning effort does not improve social cognition and may actively interfere with it.
Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.
Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.
7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.