Why do logically invalid chain-of-thought examples work nearly as well?

This explores why chain-of-thought prompts with broken logic still boost performance almost as much as correct ones — and what that reveals about whether CoT is reasoning at all.

This explores why chain-of-thought prompts with broken logic still boost performance almost as much as correct ones. The short answer the corpus keeps circling back to: chain-of-thought works by showing the model the *shape* of reasoning, not the *substance*. When researchers fed models deliberately illogical CoT exemplars on hard benchmarks, performance barely dropped — because what the model picks up is the form (lay out steps, march toward an answer), not the validity of the inference inside those steps Does logical validity actually drive chain-of-thought gains?. The gains come from structural cues, not logical correctness.

The deeper framing across the collection is that CoT is *constrained imitation* — the model reproduces familiar reasoning patterns it saw in training rather than performing genuine symbolic inference Does chain-of-thought reasoning reveal genuine inference or pattern matching? Why does chain-of-thought reasoning fail in predictable ways?. This reframes the whole question: if CoT were real deductive machinery, scrambling the logic would wreck it. Instead, format dominates content. One striking measurement: training *format* shapes a model's reasoning strategy 7.5× more than the *domain* of the problem, and where you place a demonstration can swing accuracy 20% What makes chain-of-thought reasoning actually work?. The 'reasoning' is pattern-guided generation wearing the costume of logic.

The same pattern shows up at inference time, not just in prompts. When researchers corrupted the reasoning *traces* models train on — making them systematically irrelevant — accuracy held and sometimes generalization even improved, suggesting traces act as computational scaffolding (room to compute) rather than meaningful steps Do reasoning traces need to be semantically correct?. A related study of R1 found its intermediate tokens carry no special execution semantics; invalid traces routinely produce correct answers, so the trace correlates with the answer through learned formatting, not functional reasoning Do reasoning traces actually cause correct answers?. Faithfulness work pushes this further: most CoT chains fail tests of causal sufficiency and necessity — the steps often don't actually drive the answer Do language models actually use their reasoning steps?.

Here's the twist that keeps this from being purely deflationary. CoT isn't *always* theater. Activation probes show models commit to an answer internally before they finish writing on easy problems — that's performative — but on genuinely hard problems the written reasoning tracks real internal belief updates, with detectable inflection points Does chain-of-thought reasoning reflect genuine thinking or performance?. So invalid CoT works 'nearly as well' precisely on the tasks where the reasoning was never load-bearing to begin with. Where it bites: reasoning models actually *underperform* non-reasoning ones on exception-based rule inference, because forced step-by-step structure injects overgeneralization and hallucinated constraints Why do reasoning models fail at exception-based rule inference?.

The thing you might not have known you wanted to know: more reasoning isn't automatically better, and the system seems to sense it. Optimal CoT length follows an inverted-U — accuracy peaks at a middle length and *declines* as chains get longer, and more capable models prefer shorter chains, with RL training naturally drifting toward brevity as a model improves Why does chain of thought accuracy eventually decline with length?. Even *whether* to use CoT depends on the question: for simple ones, a direct question-to-answer path beats step-by-step, and CoT only helps when the question's semantics flow into the prompt before reasoning starts Why do some questions perform better without step-by-step reasoning?. Invalid logic surviving is the visible symptom of a quieter truth — the scaffolding, not the deduction, is doing most of the work.

Sources 11 notes

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

Do language models actually use their reasoning steps?

LLM reasoning chains fail both causal sufficiency (steps don't always matter) and causal necessity (spurious steps are common). Research shows most CoT evaluation measures output quality, not whether reasoning actually caused the answer.

Does chain-of-thought reasoning reflect genuine thinking or performance?

Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.

Why do reasoning models fail at exception-based rule inference?

Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Why do logically invalid chain-of-thought examples work nearly as well?

Sources 11 notes

Next inquiring lines