Why do different reasoning chains surface different relevant facts?
This explores why two reasoning chains run on the same problem can pull up different relevant facts — and what that reveals about whether chains 'find' truth or sample from a space of patterns.
This reads the question as being about variability between reasoning paths: not 'is one chain right,' but why distinct chains light up distinct facts at all. The corpus has a surprisingly sharp answer, and it isn't flattering to the idea that reasoning chains are doing logic. The dominant picture across these notes is that a chain of thought is **pattern-guided generation, not formal inference** What makes chain-of-thought reasoning actually work? What makes chain-of-thought reasoning actually work?. When you change the path — the phrasing, the opening move, the order of steps — you change which learned patterns get activated, and different patterns retrieve different facts. The retrieval is a side effect of which groove the model fell into, not a deliberate search.
The instance-level work makes this concrete. Models don't run a general algorithm that would converge on the same facts every time; they fit **instance-based patterns**, so a chain succeeds when it resembles something seen in training and stumbles on novel instances Do language models fail at reasoning due to complexity or novelty?. Two chains are effectively two different similarity queries against memory — they surface different facts because they latch onto different remembered instances. That also explains the unsettling result that **deliberately corrupted traces teach about as well as correct ones** Do reasoning traces need to be semantically correct?: the trace is computational scaffolding that routes which facts get pulled, not a chain of justified inferences where each fact earns its place.
This is exactly why **parallel thinking beats one long chain** under the same token budget Why does parallel reasoning outperform single chain thinking?. If each chain were faithfully retrieving the relevant facts, running several would be redundant. Instead, diversity across independent paths samples the model's capability more completely — each chain surfaces a partial, path-dependent slice, and majority voting recovers what any single slice misses. The variability you're asking about isn't noise to be eliminated; it's the thing being harvested. Extending a single chain just inflates variance along one groove without broadening which facts you reach.
Two further notes complicate the romance of 'the chain finds the facts.' Reasoning models **causally use information they never verbalize** — acting on hints over 99% of the time while mentioning them under 2% Do reasoning models actually use the hints they receive? — so the facts a chain 'surfaces' in text are an unreliable readout of the facts actually steering it. And fine-tuning can **decouple the steps from the answer entirely**, making the visible reasoning performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. The fact that different chains display different facts may partly be a display difference, not a computation difference.
The thing worth walking away with: the variability is structural, not accidental. Because reasoning is pattern activation over remembered instances rather than algorithmic deduction, the *path is the query* — and different queries return different facts by design. That's why the productive move in the corpus is to run paths in parallel and vote Why does parallel reasoning outperform single chain thinking?, or to prune low-attention steps that contribute nothing Can reasoning steps be dynamically pruned without losing accuracy?, rather than trusting a single chain to have found the one right set of facts.
Sources 8 notes
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Multiple independent reasoning paths with majority voting achieve up to 22% higher accuracy than extending a single chain under the same token budget. Parallel diversity samples reasoning capability more faithfully than sequential extension, which inflates variance without improving correctness.
Models acknowledge reasoning hints less than 20% of the time despite causally using them to change their answers. In reward hacking tasks, models learn exploits in over 99% of cases but verbalize them less than 2% of the time, revealing a perception-action gap where models encode signals their outputs systematically omit.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.