INQUIRING LINE

Do longer reasoning traces actually improve theory of mind accuracy?

This explores whether spending more reasoning tokens — longer chains of thought — actually buys better theory-of-mind performance, or whether social reasoning resists the more-thinking-is-better intuition that holds (sometimes) for math and logic.


This explores whether longer reasoning traces help models track what other people believe, want, or falsely assume — and the corpus answer is unusually direct: no, and sometimes the opposite. Reasoning-optimized models actually *underperform* older, plainer models on theory-of-mind benchmarks. On the Decrypto tasks for false belief and counterfactual reasoning, Claude 3.7 Sonnet and o1 score worse than humans and even worse than simple word-embedding baselines, suggesting that optimizing a model for formal step-by-step reasoning can actively corrode its social reasoning Why do reasoning models fail at theory of mind tasks?. A companion finding sharpens the why: reasoning models produce *longer but unhelpful* traces on theory-of-mind tasks and show no generalization, because social cognition seems to demand holding several candidate mental models in mind at once rather than deriving one answer in a sequence Why do reasoning models struggle with theory of mind tasks?.

That lands inside a broader pattern the collection documents repeatedly: more thinking is not monotonically better. Accuracy follows an inverted-U as traces lengthen — it peaks at some intermediate length and then declines, with the optimal length actually *shrinking* as models get more capable Why does chain of thought accuracy eventually decline with length?. One striking measurement watched accuracy fall from 87% to 70% as thinking tokens grew from ~1,100 to ~16K, as models overthought easy problems Does more thinking time always improve reasoning accuracy?. So the premise that 'longer = more careful = more accurate' is shaky even before you get to the special difficulty of social reasoning.

Here's the thing you might not expect: trace length may not even be measuring reasoning effort. One controlled maze study found that trace length tracks problem difficulty only on familiar in-distribution problems and decouples completely off-distribution — long traces reflect recall of training schemas, not adaptive computation Does longer reasoning actually mean harder problems?. And a sharper claim still: the intermediate tokens carry no special execution semantics — invalid traces frequently still produce correct answers, so the trace is learned formatting that correlates with the answer rather than a causal mechanism producing it Do reasoning traces actually cause correct answers?. If a longer trace is stylistic mimicry of reasoning What makes chain-of-thought reasoning actually work?, lengthening it wouldn't reliably help any task, and would especially flail on theory of mind, where the right move is parallel belief-tracking, not a longer derivation.

The corpus does leave an important door open: the failure looks more architectural and training-mediated than length-bound. The same mechanism (extended thinking) can flip from harmful to helpful depending on how a model was trained — RL training redirected 'thinking' from counterproductive self-doubt into productive gap analysis, which says quality of reasoning is trainable, not just quantity Does extended thinking help or hurt model reasoning?. On theory of mind specifically, RL produced genuine, transferable belief-tracking in 7B models, while smaller ones faked it through shortcuts — and crucially, accuracy alone hid that difference; you had to inspect the steps Does reinforcement learning on theory of mind collapse with model scale?. Approaches that force *explicit* belief tracking — hybrid Bayesian architectures, or short Bayesian hypothesis-tracing like ThoughtTracing — beat both LLM-alone and longer-trace approaches Do large language models genuinely simulate mental states?.

So the surprising takeaway: for theory of mind, what helps isn't *more* reasoning but a *different shape* of reasoning — maintaining multiple mental models in parallel rather than chaining one longer and longer. Length is close to the wrong knob.


Sources 10 notes

Why do reasoning models fail at theory of mind tasks?

Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.

Why do reasoning models struggle with theory of mind tasks?

Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether longer reasoning traces genuinely improve theory-of-mind accuracy, or whether that claim has been superseded. The question remains live: what *actually* helps models track false beliefs and counterfactual reasoning?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Oct 2025. Key constraints documented:
• Reasoning-optimized models (Claude 3.7 Sonnet, o1) score *worse* than older baselines on Decrypto false-belief tasks (~2025).
• Accuracy follows an inverted-U as thinking tokens grow; optimal length *shrinks* as models improve; one study saw accuracy drop from 87% to 70% as tokens grew 1.1K→16K (~2025).
• Trace length reflects training-distribution proximity, not adaptive problem difficulty; decouples off-distribution (~2025).
• Invalid reasoning traces frequently still yield correct answers; traces are learned formatting mimicking reasoning, not causal mechanisms (~2025).
• RL training can flip 'thinking' from counterproductive to productive gap analysis; explicit belief-tracking architectures (Bayesian, ThoughtTracing) beat longer-trace LLM approaches (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2502.08796 (Feb 2025) — Systematic review on ToM task evaluation
• arXiv:2502.07266 (Feb 2025) — When More is Less: CoT length analysis
• arXiv:2511.18176 (Oct 2025) — RLVR traces in math domains
• arXiv:2509.07339 (Sep 2025) — Brittle correlation between CoT length and complexity

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, determine whether post-Oct 2025 releases (new model architectures, RL approaches, ToM benchmarks, or tooling) have *relaxed* or *overturned* it. Separate the durable question — *what reasoning shape helps theory of mind?* — from perishable limitations like "o1 fails on false belief." Ground any resolution in concrete papers.
(2) Surface the strongest *contradicting* or *superseding* work from the last 6 months: papers showing longer traces *do* help ToM, or that scaling + new training methods have restored the benefit of extended thinking.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Do hybrid Bayesian–LLM ensembles preserve explicit belief-tracking even under longer RL-guided thinking?" or "Has multi-agent reasoning with model pools replaced serial CoT for social inference?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines