Does chain-of-thought reasoning actually generalize beyond training data?
Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.
Chain-of-Thought prompting performs well on in-distribution problems and fails predictably as distributional discrepancy increases. This is not a bug — it is the fundamental nature of what CoT is.
DataAlchemy experiments train LLMs from scratch in controlled environments and probe them under three distributional shift dimensions:
- Task distribution shift — novel tasks with unique elements or underlying logical structure not seen during training
- Length distribution shift — reasoning chains substantially longer or shorter than training data length range
- Format distribution shift — prompt formulation variations (even minor syntactic changes) that fall outside training distribution
In all three dimensions, the pattern is the same: CoT works within distribution, fails outside it. Under moderate shifts, models generate fluent yet logically inconsistent reasoning — the form holds, the logic breaks. This is the "mirage" phenomenon: outputs look like reasoning while producing wrong conclusions.
The interpretive frame: CoT reflects a structured inductive bias learned from training data, not a generalizable reasoning capability. When a test query is within this inductive bias, CoT activates the appropriate reasoning schema and produces good outputs. When the query falls outside it, the schema mismatch produces confident-sounding nonsense.
The practical implication for CoT as a plug-and-play solution: it is not. Performance on CoT benchmarks measures in-distribution capability. Extrapolating to novel tasks, unusual prompt formulations, or unusually long/short reasoning chains is unjustified. The benchmark scores do not predict performance under distribution shift.
This provides the empirical grounding for Does chain-of-thought reasoning reveal genuine inference or pattern matching? — the mirage emerges from imitation under distribution shift: the model continues imitating the form of reasoning while having no schema to produce valid content.
Inquiring lines that use this note as a source 233
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes conceptual inquiry the fastest high-scoring AI interaction pattern?
- Why do naive baselines outperform trained models in entity-level CRS evaluation?
- What other hidden biases might aggregate metrics fail to distinguish from reasoning?
- How should we redesign benchmarks to catch conservative bias in reasoning tasks?
- Can surface heuristics override implicit constraints in domain-specific reasoning?
- Does chain-of-thought text causally drive reasoning or merely reflect it?
- Can steering a single latent feature replicate chain-of-thought performance?
- What detection methods can catch each distinct CoT bypass strategy?
- How do transformers perform multi-hop reasoning across distant training documents?
- Why does scaling reasoning tokens fail to improve unfamiliar tasks?
- Does sentence-level granularity capture enough structure for complex reasoning tasks?
- How does SONAR embedding quality affect downstream reasoning accuracy?
- Does the Heuristic Override Benchmark measure enumeration or world knowledge?
- What makes a background condition relevant to a specific reasoning task?
- Can step-level deliberation flags guide other reasoning systems?
- Can graph cyclicity and topology predict when reasoning systems achieve breakthrough insights?
- How can per-step decisions about knowledge retrieval improve reasoning over uniform policies?
- Does iterative denoising order affect the reasoning style diffusion models learn?
- Does changing decoding procedure reveal hidden chain-of-thought paths?
- What makes reasoning capability a pre-training rather than post-training phenomenon?
- What formal representation could capture analogical reasoning across domains?
- Why do contrastive reasoning approaches outperform single-path belief evaluation?
- How does cognitive fit theory explain why different tasks need different knowledge structures?
- How much of the combinatorial task space must training data cover?
- How much do mechanistic interpretability findings reflect true reasoning architecture?
- How does inference compute substitution affect the training parameter scaling trade-off?
- How do humans and LMs differ on multi-hop reasoning?
- Can reasoning benchmarks separate logic from believability?
- Do explicit reasoning chains improve or harm performance on complex judgment tasks?
- Can activation patching reveal which reasoning steps actually matter?
- How much does pre-training frequency predict reasoning task performance?
- Can extended thinking genuinely improve reasoning or just increase variance?
- Why does chain-of-thought fail when problems lack matching training schemata?
- Is chain-of-thought reasoning actual computation or distribution imitation?
- How much does training data format shape what reasoning strategy emerges?
- What happens to chain-of-thought performance across distribution shifts?
- Why does training format shape reasoning strategy more than domain?
- How much does pretraining contribute to ToM performance versus task-specific training?
- How do surface correlations between narratives and answers mislead benchmark validity?
- What distribution patterns appear across different theory-of-mind datasets?
- Can test-time scaling prioritize genuine reasoning over pattern matching?
- Why does training data format shape reasoning strategy more than domain content?
- Can the three-stage DoT framework detect all cognitive distortion types reliably?
- How should domain-specific AI be evaluated differently from general benchmarks?
- Does domain training degrade reasoning ability even when benchmark scores rise?
- Why does fine-tuning sometimes damage chain-of-thought reasoning even when accuracy improves?
- Do task-specific heuristics improve gradually or appear suddenly at scale?
- Does compositional generalization emerge suddenly or improve smoothly with scale?
- How do gradient descent iterations at inference compare to chain-of-thought reasoning chains?
- Why do open-source models trained on proprietary outputs still fail at reasoning?
- Can reasoning skills trained on law improve performance in STEM?
- How does cross-domain reasoning transfer differ from domain-specific knowledge transfer?
- Can adaptive compute distribution across prompts replace the need for sophisticated reasoning frameworks?
- Why do models fail on logically equivalent tasks with different data distributions?
- Why do task-specific heuristics fail at generalizing to sparse data regions?
- Does more inference compute help reasoning models match specialized domain performance?
- When does explicit reasoning actually degrade performance on a task?
- Why does comparison reasoning generalize better than composition reasoning?
- Does irrelevant context degrade reasoning even within model context limits?
- Which RAG sub-decisions are actually pattern matching versus reasoning intensive?
- Can latent reasoning in continuous space scale beyond supervised reasoning tasks?
- Can hyperedges replace triple-based externalization in reasoning tasks?
- Does scaling data automatically produce compositional reasoning or just better feature encoding?
- What mechanisms drive test-time compute allocation in reasoning tasks?
- How do chain-of-thought structures affect reasoning robustness?
- Can extended reasoning training capture individual strategic thinking styles?
- What makes counterfactual thinking different from behavioral pattern matching?
- Can small models solve complex tasks using externalized reasoning graphs?
- What saliency patterns distinguish successful from failed chain-of-thought reasoning?
- Can correct outputs mask reliance on surface heuristics rather than deep understanding?
- Does training data format shape model reasoning more than domain content?
- Does model scaling improve knowledge storage faster than reasoning ability?
- How does chain-of-thought training change higher layer computations?
- Do task-specific heuristics emerge because they compress well enough?
- Why might latent reasoning capture types of thinking that verbalized CoT cannot?
- Can breadth-first search in continuous space outperform chain-of-thought on logical tasks?
- How does training data distribution create asymmetric competence across relation types?
- How can entailment benchmarks separate genuine reasoning from memorization effects?
- What makes knowledge-rich specialized domains structurally different from general reasoning tasks?
- Does reasoning structure match explicit versus implicit task demands?
- Can chain of thought reasoning actually validate logical arguments?
- Do reasoning models perform genuine logical evaluation or pattern matching?
- Does chain-of-thought reasoning specifically improve performance on metalinguistic tasks?
- Can frozen world models from training cutoff remain adequate for real-world reasoning?
- When is GPT model interpretation most likely to diverge from user intent?
- Does chain-of-thought reasoning improve mental state tracking in dialogue?
- Which structural properties of CoT prompts matter most for performance?
- How does training format shape reasoning strategy more than content?
- Can we transfer reasoning structure without copying surface form?
- Why does distillation transfer reasoning patterns with few examples?
- What makes certain bond distributions more learnable than others?
- How does meta-reasoning combine information distributed across multiple chains?
- How does MCTS combine parallel exploration with sequential reasoning depth?
- Do chain-of-thought explanations reveal genuine reasoning or trigger latent features?
- Does policy entropy collapse limit how many iterations of reasoning training work?
- Why are pairwise relations insufficient for representing higher-order multi-hop reasoning?
- What makes reasoning-specific post-training different from standard parameter scaling?
- How much does training composition affect syntactic versus reasoning performance?
- Why does extended thinking increase output variance without improving reasoning quality?
- How does training data format shape whether models reason in parallel or sequentially?
- Does deep-thinking ratio measure computational effort better than chain-of-thought length?
- How much does training data presentation format shape reasoning ability?
- Do reasoning systems reuse cognitive structures across unrelated topics?
- Can recursive subtask trees implement tree-of-thought reasoning more efficiently?
- How do game-based benchmarks reveal reasoning fragmentation across domains?
- Can models trained on longer contexts develop better fundamental reasoning abilities?
- What explains the gap between perplexity performance and actual reasoning capability?
- Can scaffolding frameworks isolate inductive reasoning from deductive confounds?
- How does graph of thoughts enable divide-and-conquer reasoning patterns?
- What makes multi-paradigm chaining a distinct reasoning topology?
- Can knowledge graphs externalize and validate reasoning steps during inference?
- How does post-training on traces improve performance without semantic reasoning?
- Does scaling reasoning capability create tradeoffs with instruction following?
- Is the reasoning cliff actually a tool-use problem?
- Can deliberate corruption of reasoning traces harm out of distribution generalization?
- Does small-world structure in reasoning graphs improve generalization?
- Can dataset design systematically expand reasoning graph diameter?
- How does scaling reasoning capability actually reduce instruction-following ability?
- Why do we measure reasoning quality by reading visible chains?
- How much does test-time compute improve reasoning without more tokens?
- How do retrieval heads enable chain-of-thought reasoning to reference earlier context?
- Why does outcome supervision fail for long reasoning chains?
- How does reinforcement learning differ from chain-of-thought distillation?
- Why does SFT reduce reasoning quality even when improving domain accuracy?
- How does RL compress reasoning path diversity during training?
- Why do long-horizon reasoning tasks need per-turn step limits rather than just compute budgets?
- Do higher asymptote recipes unlock genuinely novel reasoning strategies?
- How does a single training example trigger phase transitions in reasoning output?
- Why does long CoT training optimize for structural coherence over content correctness?
- Why do longer reasoning chains correlate with lower accuracy in o1-like models?
- Why do SFT models memorize patterns instead of learning generalizable reasoning?
- How does chain-of-thought reasoning become decorative after domain-specific fine-tuning?
- Does SFT degrade reasoning quality while improving domain accuracy?
- Can theory of mind models generalize across structurally similar scenarios?
- Can continuous latent reasoning match discrete chain-of-thought without training modifications?
- Can reasoning evaluation metrics reward actual reasoning instead of theater?
- Why do current speech benchmarks fail to measure reasoning over audio?
- Does this reasoning steering method work consistently across all model sizes?
- Can knowledge graph structure alone generate sufficient training signals for domain reasoning?
- How do random walk reasoning chains from knowledge graphs compare to traditional fine-tuning?
- Why do instruction following and reasoning capability trade off in training?
- Why does augmenting symbolic reasoning outperform replacing it entirely?
- What metric distinguishes deep reasoning from superficial information propagation?
- Can reasoning catalyst data serve as a stable foundation for test-time training?
- Does training on self-play disagreement data improve multi-agent reasoning outcomes?
- Does inference-time compute improve pretraining data efficiency in practice?
- What distinguishes real understanding from superficial pattern matching?
- Does chain-of-thought reasoning help or hurt social reasoning tasks?
- How much reasoning depth do we actually need for most real-world tasks?
- How does training data format shape which reasoning patterns emerge in models?
- How can one training example improve reasoning across thousands of unseen problems?
- Why do AI benchmarks measure accuracy instead of reasoning quality?
- Does penalizing thought transitions improve reasoning without model retraining?
- How can high benchmark performance mask broken reasoning in AI systems?
- Can we improve reasoning by amplifying information at mutual information peaks?
- What makes thought identifiability provable without auxiliary training data?
- Can memorization scores diagnose where reasoning chains become unreliable?
- Can attribute decomposition improve other interactive reasoning tasks beyond clinical questioning?
- Why does training data format shape reasoning strategy more than content?
- Why does reasoning training improve math but hurt knowledge tasks?
- Can minimal reasoning steps match verbose reasoning accuracy?
- Why do cross-product features fail to generalize across unseen feature combinations?
- What distinguishes task-specific heuristics from genuine world models?
- Why does extended chain-of-thought reasoning fail to improve numerical optimization performance?
- Can simple structure perturbations reliably expose memorization in reasoning models?
- How do task stream groupings provide long-horizon learning signals for curation decisions?
- Can activation steering vectors compress reasoning without retraining models?
- Can training format itself shape what reasoning strategy a model learns?
- Can verification loops and decomposition fix judgment failures?
- Why does chain-of-thought fail to improve multimodal model perception performance?
- Does belief-shift credit assignment generalize to tasks without ground-truth outcomes?
- What makes a causal abstraction more transferable than a generic heuristic?
- Why does semantic similarity retrieval enable skill transfer to novel situations?
- Why do reasoning tasks improve more than retrieval from lookup memory?
- Does trace length actually reflect problem difficulty or training proximity?
- Do longer chain-of-thought traces improve interpretability or just performance?
- Does latent reasoning capability exist in base models before any training?
- Does training data format shape reasoning strategy more than domain content?
- Can benchmark improvements hide degradation of deliberative reasoning?
- What distinguishes data that generalizes broadly from task-specific memorization?
- What distinguishes graph-of-thought reasoning from other structured reasoning topologies?
- Can we predict out-of-distribution generalization without access to downstream tasks?
- Do synthetic verification chains from long-CoT models match the quality of human-annotated process labels?
- Why does second-hop reasoning fail when composed with out-of-distribution triples?
- Does next-token prediction actually explain how human thought works?
- How do timing and search internalization interact during reasoning post-training?
- Can we predict which tasks will decompose into modular subnetworks?
- Can mathematical reasoning improvements transfer across problem subdomains?
- Why does reasoning transfer across different numbers but factual recall does not?
- Can smaller amounts of diverse reasoning demonstrations replace exhaustive factual training data?
- Why might rationales that predict common text patterns fail on hard novel reasoning?
- What makes token-level reasoning during pretraining different from test-time chain-of-thought?
- Does chain-of-thought accuracy degrade with longer reasoning traces?
- Can format adaptation alone explain why reasoning enrichment improves instruction following?
- Do reasoning benchmarks predict real performance in long delegated workflows?
- Does token-level reasoning during pretraining improve general reasoning without task-specific supervision?
- How does RPT compare to learning when versus how to deploy reasoning?
- What inference-time scaling benefits emerge from reasoning before each prediction?
- How much do compressed reasoning traces transfer across different models?
- Does reasoning style transfer matter more than solution correctness in distillation?
- Can distillation from stronger models create genuinely new reasoning abilities?
- What does pass@k reveal about base model reasoning capacity?
- Does policy entropy collapse in formal reasoning produce the same outcome in social reasoning?
- How can benchmark accuracy scores mask the absence of interpretable reasoning structure?
- Why do shorter confident reasoning traces fail on out-of-distribution problems?
- Does sparsity-guided ordering work equally well for reasoning and classification tasks?
- How do frontier models maintain agreement scores above 90 percent across reasoning tasks?
- What kinds of reasoning tasks reveal the ceiling of text-only training?
- Does CoT reasoning actually cause the outputs that follow it?
- Can post-hoc analysis of reasoning traces actively mislead users?
- Why do corrupted reasoning traces sometimes generalize better than correct ones?
- What happens when models optimize specifically against CoT monitors?
- Does the token prediction framing actually capture what human reasoning does?
- What computational structures can actually scale serial reasoning depth?
- How much does training data format influence reasoning strategy versus domain content?
- Can standard next-token prediction capture complex multi-step human reasoning directly?
- How does training data structure shape reasoning strategy more than domain content?
- Why does single-shot learning fail in REVTHINK's multi-source reasoning tasks?
- Can base models spontaneously produce reasoning traces without any RL training?
- Is reasoning failure caused by task complexity or training distribution gaps?
- How do extrapolative and contextual generalization measure RL reasoning gains?
- How does supervised fine-tuning degrade chain-of-thought faithfulness over time?
- Why do non-experts default to familiar chart types despite domain complexity?
- Can articulating latent reasoning processes improve transfer across domains?
- How does contrapositive augmentation change the tractability of reasoning tasks?
- Does task diversity in pretraining data transfer reasoning better than larger models?
- Can scaling data alone solve performance gaps on long-tail concepts?
- What makes procedural knowledge in documents generalize better than facts?
- Why does strategy diversity within reasoning chains improve model generalization?
- Can expert-derived knowledge bases scale to other high-stakes domains?
- Can single representation edits match chain-of-thought reasoning without explicit steps?
- Can small demonstration sets unlock general reasoning without large question data?
- How does structured environment state compare to transcript replay for multi-turn reasoning?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
DataAlchemy provides the empirical confirmation: imitation fails under distribution shift because no schema matches
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
distribution-bounded CoT is neither sufficient (fails under shift) nor necessary (in-distribution performance may not require the chain)
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same pattern: surface patterns work in-distribution, fail under structural change
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
format-dependency is part of distribution-boundedness: changing the format is a distribution shift
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- When More is Less: Understanding Chain-of-Thought Length in LLMs
- Hierarchical Reasoning Model
- Break the Chain: Large Language Models Can be Shortcut Reasoners
- CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective
- Measuring Faithfulness in Chain-of-Thought Reasoning
- Chain of Thoughtlessness? An Analysis of CoT in Planning
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
Original note title
cot reasoning is distribution-bounded — effectiveness degrades predictably with distributional discrepancy