Why do wrong numbers cost less accuracy than shuffled reasoning steps?

This explores a puzzle the corpus circles repeatedly: corrupting the *content* of reasoning (wrong numbers, irrelevant steps) barely dents accuracy, while disrupting the *order and structure* of reasoning hurts a lot — suggesting reasoning traces work more like scaffolding than like literal argument.

This reads the question as asking why the *content* of a reasoning chain seems disposable while its *structure* does not — and the corpus has a surprisingly consistent answer hiding across several notes that don't share vocabulary. The clearest single clue comes from work showing that Do reasoning traces need to be semantically correct? teach about as well as correct ones, sometimes even improving out-of-distribution generalization. The interpretation offered there is the key: traces function as *computational scaffolding* rather than as meaningful step-by-step deduction. If the model isn't really 'reading' the numbers as a human would, then swapping in wrong numbers doesn't break much — the trace's job is to allocate compute in a familiar shape, and that shape survives.

The opposite is true for structure, and that's where the laterally-related notes light up. Order carries dependency. When reasoning goes wrong, it tends to go wrong *structurally*: Does failed-step fraction predict reasoning quality better? finds that the fraction of steps living in abandoned branches predicts correctness better than length or content quality — because those failed branches stay in context and bias everything downstream. Shuffling steps is essentially manufacturing that same pathology on purpose: you put consequences before their premises, and the model conditions on the wrong things. Why do reasoning models abandon promising solution paths? makes the same point from the failure side — reasoning models break through 'structural disorganization, not insufficient compute,' which is exactly what a shuffle imposes.

There's a deeper reason order is load-bearing while values aren't. Reasoning is autoregressive: each step is generated conditioned on the ones before it. Wrong numbers leave the conditioning chain intact (step 5 still follows from the shape of steps 1–4); shuffled steps destroy it (step 5 now follows from nonsense). This is why Do reasoning models switch between ideas too frequently? can recover accuracy purely by penalizing *when* the model switches thoughts — no retraining, no content change — and why Can intermediate reasoning points yield better answers than final ones? gets more accurate answers by sampling from intermediate points *before* premature commitment narrows the path. In both cases the lever is sequence and timing, not the truth of any individual step.

The attention evidence closes the loop. Can reasoning steps be dynamically pruned without losing accuracy? shows that whole categories of steps (verification, backtracking) receive almost no downstream attention — you can delete 75% of steps and keep accuracy. That tells you most of the *content* is low-weight: the model isn't leaning on it. What it does lean on is the ordered backbone that those low-attention steps hang from. Corrupt a node the model barely attends to and little happens; reorder the backbone and you've changed what every later token is conditioned on.

The thing you might not have expected to learn: this implies chain-of-thought is closer to a *procedure* than to an *explanation*. The fragility lives in the sequencing logic, not the facts — which is why interventions that respect order (step-level confidence filtering, transition penalties, intermediate sampling) keep showing up in this corpus as cheap wins, while the field is slowly conceding that the literal correctness of intermediate steps was never doing as much work as it looked like.

Sources 6 notes

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does failed-step fraction predict reasoning quality better?

Across 10 reasoning models, the fraction of steps in abandoned branches consistently predicts correctness better than CoT length or review ratio. Failed branches persist in context and bias subsequent reasoning, a phenomenon confirmed through correlation, reranking, and direct causal editing.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Can intermediate reasoning points yield better answers than final ones?

Segmenting reasoning traces into subthoughts and prompting completions from each intermediate point yields mode answers up to 13% more accurate than final answers. This works because it mines alternative paths before early commitment narrows the solution space.

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

Why do wrong numbers cost less accuracy than shuffled reasoning steps?

Sources 6 notes

Next inquiring lines