INQUIRING LINE

What three factors actually drive chain of thought performance improvements?

This explores what actually produces the accuracy gains we attribute to chain-of-thought reasoning — and the corpus points to a surprising answer: the visible 'reasoning' is often not where the gains come from.


This explores what actually produces the accuracy gains we attribute to chain-of-thought (CoT) reasoning — and whether the model is genuinely reasoning or doing something else dressed up as reasoning. The most direct answer in the collection comes from a clever shift-cipher study that cracks CoT open into three separable factors: raw output probability, memorization, and a thin layer of genuine step-by-step reasoning that degrades as it goes What three separate factors drive chain-of-thought performance?. Output probability alone swings accuracy from 26% to 70% — the model leans toward answers that were simply more likely in pre-training. Memorization tracks how often a pattern appeared in training data. And the actual reasoning component is real but fragile: error accumulates with every additional step. So the 'reason or memorize' debate dissolves — LLMs do both at once.

What makes this genuinely unsettling is how much of the apparent reasoning turns out to be *form rather than content*. Logically invalid CoT exemplars perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, and across studies the structural and spatial properties of a prompt shape outcomes far more than its logical correctness — training format influences reasoning strategy 7.5× more than the problem domain, and demo position alone can swing accuracy 20% What makes chain-of-thought reasoning actually work?. The synthesizing view is that CoT is *constrained imitation*: pattern-matched reproduction of what reasoning looks like, not genuine inference What makes chain-of-thought reasoning actually work?. This reframes the whole question — the third 'factor' driving performance isn't a reasoning skill but the model's fluency at mimicking a reasoning-shaped surface.

The collection also pushes back on the intuition that *more* reasoning means *better* reasoning. Accuracy follows an inverted-U against length — it peaks at an intermediate amount and then declines, with the optimal length rising for harder tasks but shrinking as the model gets more capable Why does chain of thought accuracy eventually decline with length?. Push thinking tokens from ~1,100 to ~16K and benchmark accuracy can fall from 87% to 70% as the model overthinks easy problems Does more thinking time always improve reasoning accuracy?. And longer traces don't even signal harder problems — trace length mostly reflects how close a problem sits to the training distribution, not how much computation it actually needs Does longer reasoning actually mean harder problems?.

There's a real counterweight, though, so this isn't pure debunking. On genuinely compositional tasks — like graph connectivity, where you *must* accumulate intermediate results in order — sequential CoT delivers an exponential advantage over parallel voting When does sequential reasoning beat parallel voting?. So when the problem structurally demands step-by-step accumulation, the reasoning factor earns its keep. And whether thinking helps at all depends on training: vanilla models can use extended thinking counterproductively, spiraling into self-doubt, while RL training redirects the same mechanism into productive analysis Does extended thinking help or hurt model reasoning?.

If you want to follow the thread further, the faithfulness work is where it gets pointed: fine-tuning can quietly sever the causal link between the reasoning steps and the final answer, so the chain becomes performative — it reads like an explanation but no longer drives the output Does fine-tuning disconnect reasoning steps from final answers?. Put together, the corpus suggests the honest answer to 'what drives CoT gains' is: answer probability, memorized patterns, and a genuine-but-brittle reasoning component — and that the last one matters most precisely when the task can't be faked by the first two.


Sources 10 notes

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

When does sequential reasoning beat parallel voting?

On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher re-testing claims about chain-of-thought reasoning in LLMs. The question remains open: what three factors actually drive CoT performance gains—and do any represent genuine reasoning, or is it all constrained imitation?

What a curated library found — and when (findings span 2023–2025, dated claims not current truth):
• Output probability dominates: raw pre-training likelihood swings accuracy from 26% to 70%; logically invalid CoT works nearly as well as valid (2023–2024).
• Memorization and training format matter 7.5× more than problem domain; demo position alone shifts accuracy ±20% (2024–2025).
• A third, genuine reasoning layer exists but is fragile: error accumulates per step; optimal CoT length follows an inverted-U and peaks at intermediate depth, degrading beyond ~1,100–16K tokens as models overthink (2025).
• On structured tasks (graph connectivity), sequential CoT yields exponential advantage over voting; on most others, reasoning reads performative—fine-tuning can sever the causal link between steps and output (2024–2025).
• Trace length reflects training distribution proximity, not problem difficulty (2025).

Anchor papers (verify; mind their dates):
• arXiv:2407.01687 — Deciphering the Factors (2024-07) — the shift-cipher study isolating probability, memorization, reasoning.
• arXiv:2307.10573 — Invalid Logic, Equivalent Gains (2023-07) — logically invalid prompts perform nearly as well.
• arXiv:2505.21825 — Let Me Think! Exponential Advantage (2025-05) — sequential CoT on compositional tasks.
• arXiv:2509.07339 — Performative Thinking? (2025-09) — CoT length is brittle signal; causal link fragile post fine-tune.

Your task:
(1) RE-TEST EACH CONSTRAINT. For output probability, memorization, and reasoning fragility: has newer scaling (test-time compute, o1-style RL, multi-step verification, or training innovations) relaxed any? Do intermediate reasoning layers now accumulate errors less predictably? Do longer traces help more capable models? Separate the durable claim (CoT mixes three sources) from perishable limits (e.g., error-per-step rates, optimal-length ranges); cite what evidence would overturn them.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has recent work shown reasoning is NOT constrained imitation on specific domains, or that fine-tuning restores faithfulness?
(3) Propose 2 research questions that assume the regime has moved: (a) Does RL at test time (or scaling compute post-training) stabilize reasoning fidelity at depth? (b) Can structured task performance generalize to unstructured domains, or is the exponential advantage domain-locked?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines