SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling

Does every correct chain-of-thought trace improve fine-tuning?

Are all answer-correct reasoning traces equally valuable for training? This explores whether some correct traces contain reasoning that actually harms model learning despite reaching the right answer.

Synthesis note · 2026-06-03 · sourced from Reasoning Critiques

The standard assumption behind distilling long chain-of-thought traces into a smaller model via SFT is that a trace is useful supervision once its final answer is correct. "Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces" (2605.29288) breaks that assumption. It identifies post-conclusion continuation: a segment where the answer is already sufficiently supported, but the trace keeps reasoning — and that tail, even though it preserves the correct answer, is harmful to train on. A delete-only editor that excises the post-conclusion suffix while keeping the answer produces measurably better SFT than training on the full trace. The authors name the empirically confirmed phenomenon harmful continuation and ship a lightweight boundary proxy, Harmful Continuation Cut (HCC), that approximates where useful reasoning ends.

The diagnostic move is what makes this distinct. The harmful tail is characterized by an uncertainty–geometry mismatch: persistent local uncertainty (the model keeps exploring as if unsettled) combined with weakened terminal-directional hidden-state progress (the exploration no longer moves the representation toward the answer). That mismatch is the signature — not length itself. A random-cut baseline that removes a length-matched suffix without identifying where reasoning concluded performs far worse (avg 29.0 vs HCC's 49.3 across MATH500/AMC23/GSM8K), proving the gain comes from cutting the right segment, not from shorter outputs.

This sits beside but does not duplicate the vault's existing trace-quality findings. It is not the faithfulness decay of Does fine-tuning disconnect reasoning steps from final answers?, nor the benchmark-vs-quality divergence of Does supervised fine-tuning improve reasoning or just answers? — both describe what fine-tuning does to a model, whereas harmful continuation is a property of the training data itself. It sharpens the correlation in Why do correct reasoning traces contain fewer tokens?: shorter-is-better holds, but the causal lever is removing post-conclusion exploration, not length per se. And it gives a data-curation counterpart to Can reasoning steps be dynamically pruned without losing accuracy? — redundancy that is steerable at inference is also deletable at training time.

Relevant Notes

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

answer-correct chain-of-thought traces can still harm SFT — reasoning that continues after the answer is supported is low-value supervision and deleting it improves fine-tuning