What makes policy self-distillation more effective than external teacher distillation?
This explores why a model often learns better from data it generates itself (policy self-distillation) than from traces handed down by a stronger external teacher — and where that advantage breaks down.
This explores why a model often learns better from its own generated traces than from a stronger external teacher's — and the corpus has a surprisingly consistent answer: the bottleneck isn't trace quality, it's compatibility with the student's own representations. SEAL is the clearest case — models learned better from synthetic data they restructured themselves than from data written by stronger external models, lifting QA accuracy from 33.5% to 47.0%, because self-generated data aligns with the learner's representational needs Does self-generated training data improve model learning?. The same logic explains why teacher-refined data can backfire: when refinements exceed the student's learning frontier, objectively higher-quality data degrades performance, and students do better filtering for what's compatible with their own statistical profile Does teacher-refined data always improve student model performance?.
The deeper mechanism is a distribution-mismatch problem. Training on offline teacher traces fails at self-correction because the teacher's errors aren't the student's errors — models only learn to fix mistakes by practicing on their actual error distribution under online RL Why does self-correction training on offline data fail?. So self-distillation wins partly because the policy is, by definition, training on its own distribution.
But there's a sharp catch the corpus insists on: self-distillation isn't free. Distilling from a confident teacher — or from yourself in a way that mimics one — suppresses the epistemic markers ('Wait,' 'Hmm') that flag flawed reasoning paths, trading out-of-distribution robustness for confident, concise in-domain answers Does self-distillation harm mathematical reasoning performance?. Richer teacher context makes this worse, not better: conditioning a teacher on correct answers and verifier output yields shorter, more confident student traces that generalize worse on problems requiring caution Does richer teacher context hurt student generalization?. The advantage of self-distillation, then, isn't that it's safe — it's that it keeps the student inside its own distribution rather than forcing a style it can't support.
The surprising twist is the opposite case, where the external teacher wins. Walmart's BERT cross-encoders outperformed the very LLM teachers that labeled their data, because the student saw a far broader input distribution smoothed by teacher predictions Can smaller models outperform their LLM teachers with enough data?. So the real variable isn't 'self vs. external' — it's distributional coverage and representational fit. When the teacher's labels broaden the student's exposure, external wins; when the teacher's style narrows or mismatches it, self-distillation wins.
Worth knowing where this bottoms out: pure self-improvement is formally bounded by the generation-verification gap — a model can't reliably exceed what it can verify, and every durable gain smuggles in an external anchor (a past checkpoint, a judge, a tool, a user correction) Can models reliably improve themselves without external feedback?, What stops large language models from improving themselves?. The methods that push self-distillation furthest sneak verification back in: asymmetric self-play uses a proposer-solver split with majority-vote checking Can language models improve themselves without any external training data?, and self-examining RL alternates actor and judge roles to manufacture an internal reward signal Can models learn to judge themselves without external rewards?. Self-distillation works best not when it's purely self-contained, but when the model is its own best-matched teacher and something still plays the role of verifier.
Sources 10 notes
SEAL demonstrates that models learn better from synthetic data they generate themselves than from data created by stronger external models. Self-generated data improved QA performance from 33.5% to 47.0%, suggesting that model-specific restructuring aligns with the learner's representational needs.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.
Self-distillation reduces performance in mathematical reasoning by eliminating epistemic markers like "Wait" and "Hmm" tokens that flag flawed reasoning paths. These tokens enable self-correction on out-of-distribution problems, so removing them sacrifices robustness for confident brevity.
Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.
Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.
SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.