Does reinforcement learning on theory of mind collapse with model scale?

When RL improves social reasoning, does the quality of reasoning depend on model size? The question matters because accuracy alone may hide whether models are actually thinking or just pattern-matching.

Synthesis note · 2026-02-22 · sourced from Theory of Mind

Rule-based RL has proven effective for enhancing structured reasoning in math and coding. The question is whether it generalizes to social reasoning — "interpreting mental states and hidden commonsense" — where rules and ground truths are less well-defined.

The answer is scale-dependent.

7B models: RL induces high-quality, interpretable, and transferable belief-tracking behaviors. The reasoning traces show explicit step-by-step mental state tracking: identifying what each agent knows, what each agent believes about what others know, and how beliefs update as the story progresses. This transfers across benchmarks.

≤3B models: RL leads to reasoning collapse. Despite achieving "substantial accuracy gains comparable to the larger models," these models "failed to generate interpretable, structured reasoning traces." Instead, they produce "drastically shortened, less meaningful responses" — suggesting reliance on "implicit rather than explicit structured reasoning." They appear to have internalized "alternative rules or patterns that are effective for the specific structures found in benchmark datasets."

The mechanism: simple rule-based rewards optimize for correctness, but in models with limited capacity relative to task complexity, this "may inadvertently encourage shortcut learning." The model finds a faster path to the right answer that doesn't involve actually tracking mental states. It works on benchmarks but wouldn't generalize to genuine social interaction.

This creates a "crucial mismatch between achieving high accuracy on benchmark questions and possessing genuine, human-like reasoning capabilities." The mismatch is invisible if you only look at accuracy scores — the 3B model looks comparable to the 7B model. It becomes visible only when you inspect the reasoning traces.

The finding extends the entropy collapse dynamic from formal reasoning to social reasoning, but with an important twist: in formal domains, shortcut learning tends to reduce diversity while maintaining some reasoning structure. In social reasoning, it eliminates reasoning structure entirely while preserving accuracy — a more severe form of collapse.

Inquiring lines that use this note as a source 27

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 145 in 2-hop network ·dense cluster Open in graph ↗

Does reinforcement learning on theory of mind co… Does policy entropy collapse limit reasoning perfo… Does RL teach reasoning or just when to use it? Do base models already contain hidden reasoning ab…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
the ToM reasoning collapse is an extreme form of entropy collapse: not just reduced diversity but elimination of interpretable reasoning
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
the scale-dependent finding adds a caveat: RL teaches when to activate only if the model has sufficient capacity; below threshold, RL teaches shortcuts instead
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the 7B success suggests latent ToM capability exists at scale; the 3B failure suggests it doesn't exist below a capacity threshold for social reasoning

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl on ToM produces scale-dependent reasoning collapse — large models develop belief-tracking while small models achieve accuracy through shortcuts

Does reinforcement learning on theory of mind collapse with model scale?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4