Why do student models learn better from internal pruning versus external compression?
This explores why student models trained on reasoning traces that have been pruned by the model's own internal importance signals outperform students trained on traces compressed by an external frontier model — and what the corpus says about why self-generated pruning preserves what external rewriting strips away.
This explores the gap between two ways of slimming down the reasoning traces a student model learns from: pruning guided by the model's *own* internal sense of which tokens matter, versus compression imposed from the *outside* by a separate, more powerful model rewriting the chain. The most direct evidence is that students trained on internally pruned chains outperform those trained on frontier-model compression Which tokens in reasoning chains actually matter most?. The reason is that pruning by likelihood-preservation isn't blind shortening — the model ranks tokens by functional role, throwing out grammar and meta-discourse first while protecting the symbolic-computation tokens that actually carry the reasoning. External compression has no access to that internal ranking; it optimizes for looking clean, not for keeping the load-bearing steps.
The corpus suggests the deeper issue is *what gets lost when an outside model decides what's important.* When teachers are conditioned to produce confident, concise traces — exactly the move an external compressor makes — students inherit that confidence but lose the uncertainty signals that help them generalize beyond the training distribution Does richer teacher context hurt student generalization?. Polished external output trades out-of-distribution robustness for in-domain neatness. So 'better-looking' compression can quietly amputate the epistemic hedging a student needs to handle unfamiliar problems.
There's a wider pattern here about compression as an act that destroys nuance when it's optimized too aggressively. Models tend to compress concepts harder than humans do, capturing broad category structure while losing the fine-grained distinctions that matter in context Do LLMs compress concepts more aggressively than humans do?. An external compressor applied to a reasoning chain is doing exactly this — maximizing efficiency at the cost of situated detail. Internal pruning sidesteps the trap because it's keyed to the model's own functional priorities rather than a generic 'make it shorter' objective.
Why is the model's internal signal trustworthy in the first place? Two notes hint at an answer. Models develop dense, structured representations for material they're familiar with and fall back to sparse defaults on unfamiliar input Is representational sparsity learned or intrinsic to neural networks?, and they sparsify their activations adaptively under harder, out-of-distribution tasks as a stabilizing filter rather than a failure Do language models sparsify their activations under difficult tasks?. In other words, selective internal pruning is something these models already do well — it's a learned competence, not noise. Harnessing that same instinct to trim training traces is working *with* the grain of the model.
The thing you might not have known you wanted to know: this connects to why staying close to a model's own distribution helps it keep learning. Low drift from the base model preserves plasticity for downstream tasks, while heavier external reshaping causes models to stall when domains shift Does staying close to the base model preserve learning ability?. Internal pruning keeps a student near the distribution it can actually learn from; external compression drags it toward a foreign frontier-model style. The lesson across all of these is the same — the most useful editor of a model's reasoning is often the model itself.
Sources 6 notes
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.
Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.