Does chain-of-thought reasoning actually generalize beyond training data?

Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.

Synthesis note · 2026-02-22 · sourced from Reasoning Critiques

Chain-of-Thought prompting performs well on in-distribution problems and fails predictably as distributional discrepancy increases. This is not a bug — it is the fundamental nature of what CoT is.

DataAlchemy experiments train LLMs from scratch in controlled environments and probe them under three distributional shift dimensions:

Task distribution shift — novel tasks with unique elements or underlying logical structure not seen during training
Length distribution shift — reasoning chains substantially longer or shorter than training data length range
Format distribution shift — prompt formulation variations (even minor syntactic changes) that fall outside training distribution

In all three dimensions, the pattern is the same: CoT works within distribution, fails outside it. Under moderate shifts, models generate fluent yet logically inconsistent reasoning — the form holds, the logic breaks. This is the "mirage" phenomenon: outputs look like reasoning while producing wrong conclusions.

The interpretive frame: CoT reflects a structured inductive bias learned from training data, not a generalizable reasoning capability. When a test query is within this inductive bias, CoT activates the appropriate reasoning schema and produces good outputs. When the query falls outside it, the schema mismatch produces confident-sounding nonsense.

The practical implication for CoT as a plug-and-play solution: it is not. Performance on CoT benchmarks measures in-distribution capability. Extrapolating to novel tasks, unusual prompt formulations, or unusually long/short reasoning chains is unjustified. The benchmark scores do not predict performance under distribution shift.

This provides the empirical grounding for Does chain-of-thought reasoning reveal genuine inference or pattern matching? — the mirage emerges from imitation under distribution shift: the model continues imitating the form of reasoning while having no schema to produce valid content.

Inquiring lines that use this note as a source 233

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 128 in 2-hop network ·medium cluster Open in graph ↗

Does chain-of-thought reasoning actually general… Does chain-of-thought reasoning reveal genuine inf… Do language models actually use their reasoning st… Can models pass tests while missing the actual gra… Does training data format shape reasoning strategy…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
DataAlchemy provides the empirical confirmation: imitation fails under distribution shift because no schema matches
Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
distribution-bounded CoT is neither sufficient (fails under shift) nor necessary (in-distribution performance may not require the chain)
Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same pattern: surface patterns work in-distribution, fail under structural change
Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
format-dependency is part of distribution-boundedness: changing the format is a distribution shift

Does chain-of-thought reasoning actually generalize beyond training data?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4