SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Model Architecture and Internals

Does self-distillation harm mathematical reasoning performance?

Self-distillation usually improves models while shortening outputs, but mathematical reasoning shows a puzzling exception: performance drops up to 40%. What mechanism explains this counter-intuitive degradation?

Synthesis note · 2026-05-18 · sourced from Training Fine Tuning

Self-distillation has emerged as an effective post-training paradigm — it usually improves performance while shortening reasoning traces, which is a clean win. The paper Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? documents a counter-finding: in mathematical reasoning, self-distillation can reduce response length while degrading performance, with drops of up to 40% on Qwen3, DeepSeek-Distill-Qwen, and Olmo3.

The mechanism is suppression of epistemic verbalization. Strong reasoning models like DeepSeek-R1 frequently express uncertainty mid-trace using tokens like "Wait" or "Hmm." These tokens look like noise — they do not directly advance the argument, they add length without obvious content. The standard intuition is that distilling toward shorter, more confident traces should be an improvement: same answers, less verbosity, lower inference cost.

The empirical finding contradicts this. Removing the uncertainty tokens removes the signal that a reasoning path may be flawed. When the student model is distilled away from epistemic verbalization, it loses the ability to flag and self-correct its own faulty reasoning paths. The shorter, more confident traces are correlated with worse performance on out-of-distribution problems where the model would have benefited from pausing to verbalize doubt.

This reframes "Wait" and "Hmm" tokens. They are not stylistic noise to be optimized away; they are corrective mechanism markers — the surface signature of the model noticing something is off and adjusting course. Compressing the trace by removing them is removing an internal control structure.

The implication for self-distillation design is sharp. Distillation that uses richly-conditioned teachers produces confident concise students. Confident concise students do well on in-distribution problems where confidence is warranted. They fail on out-of-distribution problems where uncertainty would have been the right response. The distillation regime needs to preserve the uncertainty channel, not just optimize for shorter correct outputs.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 135 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

self-distillation can degrade reasoning by suppressing epistemic verbalization — Wait and Hmm tokens carry uncertainty signal not noise