Can self-training drift be prevented by applying student compatibility filtering?
This explores whether the 'drift' that creeps in when a model trains on its own (or a teacher's) outputs can be headed off by keeping only the material that fits the student model's current ability — and the corpus suggests compatibility filtering is one real lever, but it's a narrower fix than the failure it's aimed at.
This reads the question as: does screening training data for student-model compatibility actually stop the slow degradation that happens when models learn from generated outputs? The most direct support is the finding that teacher-refined data — even objectively higher-quality data — *hurts* a student when it lands beyond the student's learning frontier, and that letting the student filter refinements against its own statistical profile keeps only the improvements it can absorb Does teacher-refined data always improve student model performance?. So yes, compatibility filtering is a documented mechanism, and the principle scales surprisingly far: a small student can even surpass its teacher when fed enough teacher-labeled data smoothed to its own broader input distribution Can smaller models outperform their LLM teachers with enough data?. The common thread is that 'good data' is relative to the learner, not absolute.
But the corpus reframes what 'drift' even is, and that's where the interesting tension lives. Several notes locate the real problem not in *quality* but in *distribution mismatch*: training self-correction on offline correction traces fails because the errors the model practices on aren't the errors it actually makes — the fix isn't filtering, it's generating the data online under the model's own error distribution Why does self-correction training on offline data fail?. Read together, these say compatibility filtering and on-policy generation are two answers to the same question ('is this data on-distribution for *this* model?'), one by selection and one by sourcing.
There's also a kind of drift that filtering alone can't touch: collapse of diversity. Self-training tends to narrow the tail — RL converges on a single dominant output format within the first epoch and suppresses the alternatives Does RL training collapse format diversity in pretrained models?, and self-training iterations prematurely converge unless something actively maintains spread. Here the lever isn't a compatibility filter but an injected counter-force: step-level critique models that preserve exploration diversity *during* training, not just at test time Do critique models improve diversity during training itself?. So filtering keeps incompatible data out, but it can't put diversity back in once a feedback loop has eaten it.
Worth noting that filtering itself can be cheap and dual-purpose when it rides on a signal the model already produces. Cross-rollout variance, for instance, doubles as both a reward and a query filter — discarding degenerate comparisons while weighting tokens, buying faster, more stable training on hard-to-verify tasks Can one statistical measure serve dual purposes in RL training?. That's the spirit of compatibility filtering generalized: use a self-supervised statistic to decide what's worth training on. Other notes show models internalizing this judgment entirely — learning to self-evaluate in unused post-output sequence space at zero inference cost Can models learn to evaluate their own work during training?.
The honest synthesis: compatibility filtering genuinely prevents one drift mode — absorbing data past your frontier — and the corpus backs it well. But it's one tool among three. For mismatch drift, the answer is on-policy data; for diversity-collapse drift, the answer is an active critic. If you came expecting a single switch, the more useful takeaway is that 'self-training drift' is at least three different failures wearing one name, and they each want a different fix.
Sources 7 notes
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.