SYNTHESIS NOTE

Does logical validity actually drive chain-of-thought gains?

What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.

Synthesis note · 2026-02-22 · sourced from Reasoning Logic Internal Rules

"Invalid Logic, Equivalent Gains" runs a clean experiment: replace valid reasoning in CoT exemplar prompts with completely illogical reasoning, then measure performance on BIG-Bench Hard tasks. The result: logically invalid CoT prompts perform close behind valid CoT and outperform answer-only prompting. The reasoning content of CoT exemplars is not what drives the performance gain.

This is a sharp test because it isolates the contribution of logical validity from everything else CoT provides: output format, step decomposition, intermediate token generation, attention pattern scaffolding. If invalid reasoning still helps, then the benefit comes from these structural properties, not from the reasoning itself.

The finding directly supports Does chain-of-thought reasoning reveal genuine inference or pattern matching?. If the model were learning to reason from exemplars, invalid exemplars would degrade performance substantially. Instead, the model is learning the FORM of step-by-step output — the structure activates latent capabilities without the exemplar content needing to be logically sound.

This also deepens Do language models actually use their reasoning steps?. If the exemplar reasoning doesn't need to be valid for CoT to work, then the model's own generated reasoning may similarly be decorative rather than causal. The exemplar finding makes the faithfulness concern bidirectional: neither the input reasoning (exemplars) nor the output reasoning (generated CoT) need be logically valid for the performance gain to occur.

The practical implication: CoT prompt engineering should focus on structural properties (step count, decomposition format, answer scaffolding) rather than on the logical correctness of the exemplar reasoning. Since Why do chain-of-thought examples fail across different conditions?, the dimensions that matter are structural (complexity, order, style), not logical.

Inquiring lines that use this note as a source 190

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 110 in 2-hop network ·medium cluster Open in graph ↗

Does logical validity actually drive chain-of-th… Does chain-of-thought reasoning reveal genuine inf… Do language models actually use their reasoning st… Why do chain-of-thought examples fail across diffe… Do large language models reason symbolically or se… Do reasoning traces need to be semantically correc… What do models actually learn from chain-of-though…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
invalid exemplars still working confirms form-over-content thesis
Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
bidirectional unfaithfulness: exemplar validity and output validity both decorative
Why do chain-of-thought examples fail across different conditions? Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
the dimensions that matter are structural, not logical
Do large language models reason symbolically or semantically? Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.
same source batch: if reasoning is semantic not symbolic, logical validity of exemplars is irrelevant
Do reasoning traces need to be semantically correct? Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
convergent finding from training rather than prompting: invalid exemplars (this note) and corrupted training traces (that note) both preserve performance, confirming that logical content is dispensable and structure/scaffolding is the active ingredient
What do models actually learn from chain-of-thought training? When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
the structural explanation for why invalid logic still works: CoT gains come from structural coherence (step decomposition, scaffolding) not content correctness, so logically invalid exemplars provide the same structural benefits

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

logically invalid cot prompts perform nearly as well as valid ones — valid reasoning is not the chief driver of cot gains

Does logical validity actually drive chain-of-thought gains?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4