Does logical validity actually drive chain-of-thought gains?
What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.
"Invalid Logic, Equivalent Gains" runs a clean experiment: replace valid reasoning in CoT exemplar prompts with completely illogical reasoning, then measure performance on BIG-Bench Hard tasks. The result: logically invalid CoT prompts perform close behind valid CoT and outperform answer-only prompting. The reasoning content of CoT exemplars is not what drives the performance gain.
This is a sharp test because it isolates the contribution of logical validity from everything else CoT provides: output format, step decomposition, intermediate token generation, attention pattern scaffolding. If invalid reasoning still helps, then the benefit comes from these structural properties, not from the reasoning itself.
The finding directly supports Does chain-of-thought reasoning reveal genuine inference or pattern matching?. If the model were learning to reason from exemplars, invalid exemplars would degrade performance substantially. Instead, the model is learning the FORM of step-by-step output — the structure activates latent capabilities without the exemplar content needing to be logically sound.
This also deepens Do language models actually use their reasoning steps?. If the exemplar reasoning doesn't need to be valid for CoT to work, then the model's own generated reasoning may similarly be decorative rather than causal. The exemplar finding makes the faithfulness concern bidirectional: neither the input reasoning (exemplars) nor the output reasoning (generated CoT) need be logically valid for the performance gain to occur.
The practical implication: CoT prompt engineering should focus on structural properties (step count, decomposition format, answer scaffolding) rather than on the logical correctness of the exemplar reasoning. Since Why do chain-of-thought examples fail across different conditions?, the dimensions that matter are structural (complexity, order, style), not logical.
Inquiring lines that use this note as a source 190
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes counterfeiting social warrant different from counterfeiting factual claims?
- How does validation skill replace production skill in AI systems?
- What makes accountability and validity-orientation non-behavioral properties?
- What makes emotional alignment more effective than logic when reasoning errors are exposed?
- What makes colorless green ideas fail where Jabberwocky succeeds?
- What makes quasi-beliefs real enough to explain AI behavior?
- Does good simulation eventually count as genuine realization?
- Why does item discrimination matter more than surface-level question plausibility?
- Can proxy evaluation of ideas accurately predict their quality without implementation?
- What makes Beck's diagram effective for constraining simulated patient behavior?
- How much does faithfulness vary naturally in reasoning without evaluation pressure?
- Do recency-focused prompts and in-context examples work equally well for order recovery?
- What detection methods can catch each distinct CoT bypass strategy?
- What distinguishes planning knowledge from an executable plan that works?
- Does irrelevant content degrade reasoning even when it fits the context window?
- What makes training-free approaches like Soft Thinking preferable to SoftCoT?
- What structural features force users to evaluate the epistemic status of outputs?
- What would whole-system AGI evaluation look like in practice?
- Why do contrastive reasoning approaches outperform single-path belief evaluation?
- What makes schema identification necessary after assessing thoughts and evidence?
- How does cognitive fit theory explain why different tasks need different knowledge structures?
- What structural evidence shows that polished presentation substitutes for actual thinking in AI output?
- How much do mechanistic interpretability findings reflect true reasoning architecture?
- Can high-entropy tokens and step-level confidence identify the same critical reasoning forks?
- Why do benchmark designers treat content effects as confounds?
- What mechanism causes confident false answers under high cognitive load?
- Can reasoning benchmarks separate logic from believability?
- Do explicit reasoning chains improve or harm performance on complex judgment tasks?
- Why do top performers produce shorter chains of thought in their strongest domains?
- Can models learn to select exemplars based on reasoning skills rather than complexity?
- Can activation patching reveal which reasoning steps actually matter?
- Why do logically invalid chain-of-thought examples work nearly as well?
- Does each reasoning step in chain-of-thought introduce cumulative error?
- How much does training data format shape what reasoning strategy emerges?
- What happens to chain-of-thought performance across distribution shifts?
- Can reasoning chains work without logical validity?
- How do surface correlations between narratives and answers mislead benchmark validity?
- Can test-time scaling prioritize genuine reasoning over pattern matching?
- Can contextual design decisions resist formalization into evaluation rubrics?
- Why might expressed satisfaction with explanations diverge from actual cognitive clarity?
- What makes Compound-QA expose weaknesses in monologue reasoning?
- Can the three-stage DoT framework detect all cognitive distortion types reliably?
- Does explicit reasoning help or hurt tasks requiring continuous nuanced judgment?
- Can reasoning skills trained on law improve performance in STEM?
- What makes a claim socially valid even if factually imprecise?
- Why do user studies of explanations fail to predict deployed effectiveness?
- What makes symbolic operations different from general knowledge questions?
- How does difficulty level change whether extended thinking provides genuine reasoning signal?
- Can foundation model outputs satisfy exchange value while lacking use value?
- How does cognitive load explain linguistic patterns in both deception and incorrect reasoning?
- What distinguishes genuine understanding from correct output without coherent principles?
- When does explicit reasoning actually degrade performance on a task?
- Can correct model outputs prove that semantic meaning rather than surface patterns drove the response?
- How do chain-of-thought structures affect reasoning robustness?
- What makes counterfactual thinking different from behavioral pattern matching?
- What distinguishes instance seeds from full input-output exemplar requirements?
- Why does step-by-step reasoning degrade performance on judgment-based tasks?
- What saliency patterns distinguish successful from failed chain-of-thought reasoning?
- Can structured output formats reduce instruction following degradation?
- Can correct outputs mask reliance on surface heuristics rather than deep understanding?
- What makes clinical theory grounding more effective than pattern matching alone?
- Why might latent reasoning capture types of thinking that verbalized CoT cannot?
- How can entailment benchmarks separate genuine reasoning from memorization effects?
- How can structurally different text produce equivalent real-world effects?
- How does evaluation format change what we measure about model reasoning?
- What are the seven components of genuine mental state simulation?
- How should researchers evaluate whether correct model outputs reflect real structural learning?
- Can chain of thought reasoning actually validate logical arguments?
- Do reasoning models perform genuine logical evaluation or pattern matching?
- What are collider structures and why do they reveal reasoning errors?
- Where do collider-type reasoning errors appear in real-world decisions?
- Can chain-of-thought reasoning be genuinely causal if exemplars don't need logic?
- Which structural properties of CoT prompts matter most for performance?
- How does training format shape reasoning strategy more than content?
- Does logical trace coherence guarantee valid mathematical reasoning?
- Does chain-of-thought reasoning amplify bullshit or just make it more visible?
- Can synthesized explanations be more auditable than winning-chain explanations?
- Can high test performance mask a complete absence of understanding?
- Why does reasoning effort fail to improve theory of mind performance?
- What makes accurate confidence different from confident-but-wrong predictions?
- How much does training composition affect syntactic versus reasoning performance?
- How much does omniscient evaluation overstate real-world simulation fidelity?
- Why does extended thinking increase output variance without improving reasoning quality?
- What consistency tests could distinguish constructed from genuine preferences?
- How much does training data presentation format shape reasoning ability?
- What makes structural logic correlate so strongly with contextual consistency?
- How does model confidence relate to exemplar brittleness in chain-of-thought?
- What makes tarot and periodic tables resist meaningful scientific integration?
- Why does the gap between theoretical expressiveness and learned capability matter?
- How do exemplar properties affect the brittleness of chain-of-thought prompting?
- Can functional behavior alone capture what makes something a genuine belief?
- What attention mechanisms explain why verification steps get ignored?
- How do we verify that stated beliefs actually follow from underlying motifs?
- What distinguishes inductive inference from negative evidence versus positive patterns?
- What explains the gap between perplexity performance and actual reasoning capability?
- Can users reliably distinguish valid reasoning from plausible-looking deception?
- Why do invalid reasoning steps produce nearly the same performance gains?
- What three factors actually drive chain of thought performance improvements?
- How can we measure whether process rewards actually align with reasoning quality?
- Why do format and structure matter more than actual content in reasoning?
- Why do we measure reasoning quality by reading visible chains?
- Why do invalid prompts produce reasoning traces as effectively as valid ones?
- Can training improve reasoning coherence without improving actual correctness?
- Do current math benchmarks measure outcomes or rhetorical plausibility?
- How do partial credit grading systems accidentally reward reasoning theater?
- Why do chain-of-thought outputs look logical but perform rhetorically?
- What distinguishes coherent reasoning from inaccurate but plausible predictions?
- Why does long CoT training optimize for structural coherence over content correctness?
- When does the correlation between consistency and correctness break down?
- Can multiple verification approaches together overcome the self-improvement ceiling?
- Can a single correct example seed exponential improvement in mathematical reasoning?
- Does meta-judging improve evaluator quality better than temporal decoupling alone?
- Does SFT degrade reasoning quality while improving domain accuracy?
- What cognitive structures do realistic belief models need to include?
- Does explicit reasoning help or hurt tasks requiring continuous judgment?
- Can reasoning evaluation metrics reward actual reasoning instead of theater?
- Can structured decomposition fix evaluation gaps in other research tasks?
- How do output format constraints compare to input exemplar brittleness?
- Why do instruction following and reasoning capability trade off in training?
- Why does premise ordering shift syllogistic reasoning performance by over 30 percent?
- Why does augmenting symbolic reasoning outperform replacing it entirely?
- When should verification steps be prioritized over progression steps?
- Can reasoning catalyst data serve as a stable foundation for test-time training?
- How does chain of thought amplify specific forms of rhetorical bullshit?
- Why do benchmark scores rise while reasoning quality declines?
- Does the thinking box provide genuine reasoning or just token budget?
- Why does sophisticated measurement not validate the underlying scientific inference?
- How do structured benchmarks hide theory of mind failures in LLMs?
- Why does additional reasoning effort not improve theory of mind performance?
- How do satisfaction scores differ from genuine cognitive improvement?
- Does the answer stage perform substantial reasoning beyond the thinking draft?
- Can structured prompts reduce reasoning steps while improving financial accuracy?
- Why does training data format shape reasoning strategy more than content?
- How does training on correct answer form differ mechanistically from training on failure analysis?
- What distinguishes intrinsic metacognition from extrinsic human-designed loops?
- Why does contextual judgment matter more in law and medicine than in mathematics?
- How do expert communities develop and enforce standards for valid arguments?
- Can operationalizing theory into prompt structure improve reasoning more than theory itself?
- What makes mathematically confident but incorrect answers resemble valid solution shapes?
- Can simple structure perturbations reliably expose memorization in reasoning models?
- What makes well-formatted outputs misleading as evidence of model capability?
- What structural differences emerge between early generic skills and later meta-strategy skills?
- Does the verification gap widen exactly where judgment replaces checkability?
- Can verification loops and decomposition fix judgment failures?
- How should research governance adapt to structural verification delays?
- Does belief-shift credit assignment generalize to tasks without ground-truth outcomes?
- Why do explicit quality criteria outperform learning quality from examples alone?
- Can skill validation through testing prevent unreliable programs from accumulating?
- Can a perfect behavioral simulation constitute genuine understanding or experience?
- Why does the Chinese Room argument miss the deeper abstraction problem?
- How does vehicle causality differ from content causality in physical systems?
- Why do humans trust explanations that fail counterfactual prediction tests?
- Does optimizing against CoT monitors inevitably produce obfuscated reasoning?
- Can you steer reasoning by directly manipulating SAE features?
- Why do semi-formal templates improve verification accuracy over unstructured reasoning?
- Can structured reasoning replace execution for runtime behavior verification?
- How does faithfulness differ from informativeness in chain-of-thought evaluation?
- How does test-time verification decouple the act of checking from reasoning generation?
- Can thought quality alone be trusted to guide model training?
- Do synthetic verification chains from long-CoT models match the quality of human-annotated process labels?
- What makes a deployment paradigm credible for maintaining scientific integrity?
- How should process quality and verification cost factor into evaluation judgment?
- What distinguishes genuine capability gains from coherent but invalid reasoning traces?
- Why does showing counterarguments restore users' ability to discriminate?
- Does performative reasoning mask underlying uncertainty even on easy problems?
- How should tool-call attribution distinguish credit between successful accidents and intentional actions?
- Can chain of thought monitoring reliably catch model misbehavior?
- Can format adaptation alone explain why reasoning enrichment improves instruction following?
- What evaluation methods actually measure reasoning versus execution capability?
- How does RPT compare to learning when versus how to deploy reasoning?
- How can benchmark accuracy scores mask the absence of interpretable reasoning structure?
- How can structured reasoning templates serve as rewards for code agent training?
- How much of MATH-500 improvement comes from data contamination versus real reasoning gains?
- What distinguishes metacognitive regulation from standard chain-of-thought reasoning?
- Why does target probability matter more than task logical complexity?
- Can held-out validation gates prevent optimizer hallucinations in skill proposals?
- Does CoT reasoning actually cause the outputs that follow it?
- Can formal argumentation structure replace ad-hoc fallacy classifications?
- What happens when models optimize specifically against CoT monitors?
- How do thought actions represent policy improvement steps in practice?
- What role does task structure play in rewarding delayed thinking?
- Can experimental outcomes be reliably distilled into reusable insights?
- How do local soundness signals work across different problem domains?
- Why does exemplar performance vary across order complexity diversity and style?
- How does externalizing tacit expertise into structured rules differ from prompt engineering?
- How do mechanistic interpretability tools help distinguish truthfulness from honesty?
- How do live human evaluations differ from ground-truth benchmarks?
- How brittle are chain-of-thought exemplars across order and complexity?
- What types of math proofs benefit most from proof-by-contradiction framing?
- How does structured environment state compare to transcript replay for multi-turn reasoning?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
invalid exemplars still working confirms form-over-content thesis
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
bidirectional unfaithfulness: exemplar validity and output validity both decorative
-
Why do chain-of-thought examples fail across different conditions?
Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
the dimensions that matter are structural, not logical
-
Do large language models reason symbolically or semantically?
Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.
same source batch: if reasoning is semantic not symbolic, logical validity of exemplars is irrelevant
-
Do reasoning traces need to be semantically correct?
Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
convergent finding from training rather than prompting: invalid exemplars (this note) and corrupted training traces (that note) both preserve performance, confirming that logical content is dispensable and structure/scaffolding is the active ingredient
-
What do models actually learn from chain-of-thought training?
When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
the structural explanation for why invalid logic still works: CoT gains come from structural coherence (step decomposition, scaffolding) not content correctness, so logically invalid exemplars provide the same structural benefits
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- When More is Less: Understanding Chain-of-Thought Length in LLMs
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
- Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
- Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning
- Measuring Faithfulness in Chain-of-Thought Reasoning
- CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective
Original note title
logically invalid cot prompts perform nearly as well as valid ones — valid reasoning is not the chief driver of cot gains