INQUIRING LINE

Why do corrupted traces maintain performance as well as correct traces?

This explores a counterintuitive finding — that models trained on deliberately wrong or irrelevant reasoning steps solve problems about as well as models trained on correct ones — and asks what that says about whether the 'reasoning' in a reasoning trace is doing the work we assume it is.


This explores why garbled reasoning teaches as well as clean reasoning, and the corpus has a blunt answer: the trace mostly isn't where the reasoning lives. The anchor finding is that models trained on systematically corrupted or irrelevant traces keep their accuracy — and sometimes generalize *better* out of distribution Do reasoning traces need to be semantically correct?. The proposed explanation is that traces work as computational *scaffolding* — extra tokens that give the model room to compute — rather than as a meaningful chain of inferences. If that's true, corrupting the content barely matters, because the content was never the load-bearing part.

A companion line sharpens this into a causal claim: the intermediate tokens in models like R1 are generated identically to any other LLM output, carry no special execution semantics, and invalid traces routinely produce correct answers Do reasoning traces actually cause correct answers?. So the trace correlates with the answer through learned formatting — it *looks* like a derivation — but it isn't causally necessary. That reframes your question: corrupted traces maintain performance because correct traces were never *causing* performance in the first place; both are stylistic mimicry wrapped around a computation happening elsewhere.

The evaluation papers explain why this illusion held for so long. If you score the reasoning steps instead of the final answer, you inflate capability by rewarding traces that merely *resemble* reasoning — which is exactly why some benchmarks now score only the final solution against deterministic ground truth, exposing ceilings that trace-based grading papered over Should reasoning benchmarks score final answers or reasoning traces?. The same skepticism shows up in work finding that reflection is mostly confirmatory theater — re-reading rarely changes the initial answer, and traces don't faithfully represent the underlying reasoning Can we actually trust reasoning model outputs?. Even RLVR, which measurably tidies traces, only improves *coherence* between adjacent steps without guaranteeing the proof is valid — a structural improvement, not a semantic one Does RLVR actually improve mathematical reasoning or just coherence?.

Here's the twist worth carrying away: if traces were pure scaffolding, more of them should be harmless — but they're not. Longer traces correlate with *more* errors, because extra self-revisions introduce and compound mistakes, which is why correct solutions tend to be the shorter ones Why do correct reasoning traces contain fewer tokens?. And once a model's own wrong steps fill its context, performance degrades non-linearly as it conditions on its own errors Do models fail worse when their own errors fill the context?. So the content of a trace is simultaneously *not necessary* for getting the right answer and *capable of doing harm* when it goes long or self-poisons — which is a strange, important combination.

That tension points at where the field is heading. If final-answer correctness is robust to corrupted traces but real reliability still lives in the intermediate states, then you verify the *process* during generation rather than trusting the trace as a post-hoc story — one study lifted task success from 32% to 87% by checking intermediate states and policy compliance instead of scoring outputs Where do reasoning agents actually fail during long traces?. The lesson isn't that traces are worthless; it's that their value is computational and procedural, not the human-readable narrative we instinctively trust.


Sources 8 notes

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

Should reasoning benchmarks score final answers or reasoning traces?

LR²Bench scores only final answers against deterministic ground truth, not reasoning steps. This methodological choice reveals a 20% ceiling that trace-based evaluation would inflate by counting stylistic reasoning mimicry as actual reasoning capability.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Why do correct reasoning traces contain fewer tokens?

Across QwQ, DeepSeek-R1, and LIMO, correct solutions average fewer tokens than incorrect ones. Longer traces correlate with more self-revisions, which introduce and compound errors rather than improve reasoning quality.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Next inquiring lines