Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
The inductive bias probe paper distinguishes what foundation models learn to predict from what they learn to be. A transformer trained on planetary orbital mechanics can predict trajectories across solar systems it has never seen. But when fine-tuned to predict force vectors — a cornerstone of Newtonian mechanics — it produces nonsensical laws of gravitation, different laws depending on which slice of data it is applied to.
The test is precise: a world model (Newtonian mechanics) has a specific inductive bias. If the model has internalized that world model, fine-tuning on a small dataset should leverage it — the model should extrapolate using Newtonian state. The probe reveals it does not. The inductive bias is not toward Newtonian mechanics; it is toward task-specific heuristics that work locally but do not generalize as a unified world model would.
The pattern holds across domains: Othello game positions, lattice models, orbital mechanics. In each case, models learn to predict legal next states without developing inductive bias toward the underlying state structure. They appear to work on prediction tasks because they recover "coarsened state representations or non-parsimonious representations" — compact shortcuts that are not the world model.
The no-free-lunch theorem grounds this. Every learning algorithm has an inductive bias — the functions it tends to learn when extrapolating from limited data. A world model is a restriction on possible functions; a learning algorithm with that world model should extrapolate within it. Sequence prediction does not impose this restriction. The model finds other functions that fit the training distribution without committing to the world model's structure.
"Reasoning or Reciting?" provides systematic evidence from a different angle. By constructing counterfactual variants of 11 standard tasks — variants that deviate from default assumptions — the paper shows that LLMs exhibit nontrivial performance on counterfactual versions but consistently degrade compared to default conditions. The degradation is not task-specific: it appears across all 11 tasks, suggesting a general reliance on narrow, non-transferable procedures rather than abstract reasoning. This is the behavioral signature of task-specific heuristics: they work on default (training-distribution-aligned) cases but fail when the task is logically equivalent but distributionally shifted.
Circuit-level mechanistic evidence: "Arithmetic Without Algorithms" (2410.21272) provides the most granular evidence yet for the heuristics claim. Using causal analysis to identify the arithmetic circuit in LLMs, the authors discover a sparse set of important neurons that implement simple heuristics — each neuron activates when an operand falls within a certain numerical range and outputs corresponding answers. The unordered combination of these heuristic types explains most of the model's arithmetic accuracy. The model is not running an addition algorithm. It is combining pattern-matching rules — a bag of heuristics that produces correct answers for common cases without any generalizable procedure.
This creates an apparent tension with Can large language models develop genuine world models without direct environmental contact? — that note claims text training does extract world structure. The resolution may be level of analysis: coarse semantic regularities (the note) vs. precise generative-mechanistic structure (the probe). Or it may be a genuine tension requiring empirical resolution.
The familiar vs novel dimension. François Chollet and Subbarao Kambhampati's exchange clarifies the boundary: it's not complexity per se but familiarity at the instance level that determines whether heuristics suffice. LRMs can handle arbitrarily complex tasks as long as they've been covered during training — but show an unfamiliar task, even a simple one requiring just a handful of reasoning steps, and they fail. Scaling up problem variables is a "roundabout way to generate novelty" — the complexity increase forces the model into unfamiliar territory where heuristics break. Kambhampati's rejoinder sharpens this: "we showed that LRMs do indeed lose accuracy as the size of familiar instances grow — they don't learn algorithms." Both agree transformers fit instance-based patterns, not generalizable algorithms. The delineation matters for evaluation: testing on familiar problem types at increasing scale conflates two effects (novel instances vs. algorithmic generalization).
Compositional tasks provide the clearest evidence. "Faith and Fate" (Dziri et al., 2023) shows that on multi-digit multiplication, logic grid puzzles, and dynamic programming problems, transformers solve compositional tasks by reducing multi-step reasoning to linearized subgraph matching. When test problems share computation subgraphs with training data, models succeed; when the composition is novel, they fail. Training yields near-perfect performance at low complexity but "fails drastically" outside the training distribution. Error propagation in early stages compounds to prevent correct solutions at high complexity. Since Do transformers actually learn systematic compositional reasoning?, the heuristic IS subgraph matching — and it works well enough within distribution to create the illusion of systematic reasoning. Source: Arxiv/Evaluations.
Inquiring lines that use this note as a source 20
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do foundation models develop heuristics instead of world models?
- What is selective resonance and why do transformers not perform it?
- Can world models form from aggregated partial information across training distributions?
- Do transformers learn generalizable algorithms or instance-based patterns?
- How do training objectives shape what a world model actually learns?
- Can a world model have rich representations without adequate data coverage?
- Why do energy-based models generalize better on out-of-distribution data than standard transformers?
- What inductive bias would force models to learn Newtonian mechanics instead of shortcuts?
- Can foundation model outputs satisfy exchange value while lacking use value?
- Can we decode what individual circuits inside transformers are doing?
- How do foundation models develop task-specific heuristics instead of world models?
- Why do production systems optimize for three model classes instead of foundation models?
- How do transformers generate harder solutions when mostly trained on easier problems?
- Why must world models be nested rather than flat and uniform?
- Why do foundation models develop task-specific heuristics instead of causal understanding?
- What distinguishes task-specific heuristics from genuine world models?
- Does sequence prediction accuracy prove an underlying world model exists?
- What data properties enable transformers to learn sequential decision-making in context?
- Do transformer architectures structurally bias models toward short-term optimization?
- What distinctive properties make open foundation models different from closed ones?
Related concepts in this collection 9
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can large language models develop genuine world models without direct environmental contact?
Do LLMs extract meaningful world structures from human-generated text despite lacking direct sensory access to reality? This matters for understanding what kind of grounding and knowledge these systems actually possess.
apparent tension: that note claims world structure extraction; this probe finds task-specific heuristics; level of analysis may resolve or genuine conflict
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
parallel structure: encoding doesn't imply use; prediction accuracy doesn't imply world model internalization
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same pattern in the linguistic domain: correct output without structural learning
-
Do large language models reason symbolically or semantically?
Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.
semantic dependence IS the heuristic mechanism: when commonsense semantics align with the task, heuristics produce correct answers; when they conflict, the model cannot override them
-
Why do neural networks fail at compositional generalization?
Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.
heuristics may be the network's solution to the binding problem: rather than dynamically binding entities into compositional structures (which requires solving segregation, representation, and composition), the model bypasses binding entirely by developing task-specific shortcuts that pattern-match without composing
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER is what task-specific heuristics look like at the representation level: fractured solutions that work locally within arbitrary subdomains but lack the unified principles that a genuine world model would provide
-
Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
creates a tension: scaling can produce linearly decodable compositional features, but whether these constitute genuine generalization or scaled heuristics remains open; the heuristics-vs-world-models probe suggests that even compositionally organized representations may lack the inductive bias needed for true world model behavior
-
Can language models solve ToM benchmarks without real reasoning?
Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.
ToM is a specific domain where task-specific heuristics masquerade as genuine capability: SFT matches RL on ToM benchmarks because the benchmarks contain exploitable structural patterns rather than requiring true mental state reasoning
-
Do large language models genuinely simulate mental states?
This explores whether LLMs perform authentic theory of mind reasoning or rely on surface-level pattern matching. The distinction matters because evaluation format—multiple-choice versus open-ended—reveals very different capability levels.
open-ended ToM evaluation confirms the heuristic pattern: models default to surface strategies that work on structured benchmarks but fail when task scaffolding is removed, precisely as the heuristics-vs-world-models distinction predicts
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models
- Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
- Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
- Faith and Fate: Limits of Transformers on Compositionality
- It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
- Assessing adaptive world models in machines with novel games
Original note title
foundation models develop task-specific heuristics rather than world models even when sequence prediction accuracy is high