What three factors actually drive chain of thought performance improvements?
This explores what actually produces the accuracy gains we attribute to chain-of-thought reasoning — and the corpus points to a surprising answer: the visible 'reasoning' is often not where the gains come from.
This explores what actually produces the accuracy gains we attribute to chain-of-thought (CoT) reasoning — and whether the model is genuinely reasoning or doing something else dressed up as reasoning. The most direct answer in the collection comes from a clever shift-cipher study that cracks CoT open into three separable factors: raw output probability, memorization, and a thin layer of genuine step-by-step reasoning that degrades as it goes What three separate factors drive chain-of-thought performance?. Output probability alone swings accuracy from 26% to 70% — the model leans toward answers that were simply more likely in pre-training. Memorization tracks how often a pattern appeared in training data. And the actual reasoning component is real but fragile: error accumulates with every additional step. So the 'reason or memorize' debate dissolves — LLMs do both at once.
What makes this genuinely unsettling is how much of the apparent reasoning turns out to be *form rather than content*. Logically invalid CoT exemplars perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, and across studies the structural and spatial properties of a prompt shape outcomes far more than its logical correctness — training format influences reasoning strategy 7.5× more than the problem domain, and demo position alone can swing accuracy 20% What makes chain-of-thought reasoning actually work?. The synthesizing view is that CoT is *constrained imitation*: pattern-matched reproduction of what reasoning looks like, not genuine inference What makes chain-of-thought reasoning actually work?. This reframes the whole question — the third 'factor' driving performance isn't a reasoning skill but the model's fluency at mimicking a reasoning-shaped surface.
The collection also pushes back on the intuition that *more* reasoning means *better* reasoning. Accuracy follows an inverted-U against length — it peaks at an intermediate amount and then declines, with the optimal length rising for harder tasks but shrinking as the model gets more capable Why does chain of thought accuracy eventually decline with length?. Push thinking tokens from ~1,100 to ~16K and benchmark accuracy can fall from 87% to 70% as the model overthinks easy problems Does more thinking time always improve reasoning accuracy?. And longer traces don't even signal harder problems — trace length mostly reflects how close a problem sits to the training distribution, not how much computation it actually needs Does longer reasoning actually mean harder problems?.
There's a real counterweight, though, so this isn't pure debunking. On genuinely compositional tasks — like graph connectivity, where you *must* accumulate intermediate results in order — sequential CoT delivers an exponential advantage over parallel voting When does sequential reasoning beat parallel voting?. So when the problem structurally demands step-by-step accumulation, the reasoning factor earns its keep. And whether thinking helps at all depends on training: vanilla models can use extended thinking counterproductively, spiraling into self-doubt, while RL training redirects the same mechanism into productive analysis Does extended thinking help or hurt model reasoning?.
If you want to follow the thread further, the faithfulness work is where it gets pointed: fine-tuning can quietly sever the causal link between the reasoning steps and the final answer, so the chain becomes performative — it reads like an explanation but no longer drives the output Does fine-tuning disconnect reasoning steps from final answers?. Put together, the corpus suggests the honest answer to 'what drives CoT gains' is: answer probability, memorized patterns, and a genuine-but-brittle reasoning component — and that the last one matters most precisely when the task can't be faked by the first two.
Sources 10 notes
A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.