Does self-distillation harm mathematical reasoning performance?

Self-distillation usually improves models while shortening outputs, but mathematical reasoning shows a puzzling exception: performance drops up to 40%. What mechanism explains this counter-intuitive degradation?

Synthesis note · 2026-05-18 · sourced from Training Fine Tuning

Self-distillation has emerged as an effective post-training paradigm — it usually improves performance while shortening reasoning traces, which is a clean win. The paper Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? documents a counter-finding: in mathematical reasoning, self-distillation can reduce response length while degrading performance, with drops of up to 40% on Qwen3, DeepSeek-Distill-Qwen, and Olmo3.

The mechanism is suppression of epistemic verbalization. Strong reasoning models like DeepSeek-R1 frequently express uncertainty mid-trace using tokens like "Wait" or "Hmm." These tokens look like noise — they do not directly advance the argument, they add length without obvious content. The standard intuition is that distilling toward shorter, more confident traces should be an improvement: same answers, less verbosity, lower inference cost.

The empirical finding contradicts this. Removing the uncertainty tokens removes the signal that a reasoning path may be flawed. When the student model is distilled away from epistemic verbalization, it loses the ability to flag and self-correct its own faulty reasoning paths. The shorter, more confident traces are correlated with worse performance on out-of-distribution problems where the model would have benefited from pausing to verbalize doubt.

This reframes "Wait" and "Hmm" tokens. They are not stylistic noise to be optimized away; they are corrective mechanism markers — the surface signature of the model noticing something is off and adjusting course. Compressing the trace by removing them is removing an internal control structure.

The implication for self-distillation design is sharp. Distillation that uses richly-conditioned teachers produces confident concise students. Confident concise students do well on in-distribution problems where confidence is warranted. They fail on out-of-distribution problems where uncertainty would have been the right response. The distillation regime needs to preserve the uncertainty channel, not just optimize for shorter correct outputs.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 135 in 2-hop network ·dense cluster Open in graph ↗

Does self-distillation harm mathematical reasoni… Does richer teacher context hurt student generaliz… Can post-training objectives preserve reasoning st… Do reflection tokens carry more information about … Does chain-of-thought reasoning reveal genuine inf…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does richer teacher context hurt student generalization? When teachers are given more information during distillation, they produce confident but brittle students. Does this trade-off between in-domain wins and out-of-distribution robustness hold across different task distributions?
same paper, the mechanism that produces the degradation
Can post-training objectives preserve reasoning style alongside correctness? Even mathematically sound training objectives may suppress reasoning behaviors like uncertainty expression without penalizing them. Does optimizing for answer correctness inadvertently degrade the stylistic features that enable generalization?
same paper, the broader methodology implication
Do reflection tokens carry more information about correct answers? Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.
directly supports: empirical evidence that Wait/Hmm/Therefore tokens carry disproportionate information; this paper shows what happens when they are suppressed
Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
adjacent: the broader CoT critique frame

Does self-distillation harm mathematical reasoning performance?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4