Can training order and structure shape what networks retain and learn?
This explores whether *when* and *in what arrangement* a network sees its training material — not just what the material is — changes what it ends up able to do and remember, and the corpus says yes, in surprisingly mechanical ways.
This explores whether the *order* and *structure* of training — not just the data itself — change what a network retains and learns. The collection's answer is a fairly emphatic yes, and the most striking evidence is that sequencing produces effects you'd never predict from the data alone. When a model is finetuned on documents repeated in a fixed cycle, it starts to *anticipate* forgetting — restoring performance on a document right before it comes around again — a recovery behavior that gets stronger as models scale and directly contradicts the old story of monotonic catastrophic interference Do networks recover from forgetting before re-encountering documents?. Structure alone, here, buys you memory you didn't explicitly train for.
Order matters most clearly when task *types* interact. Training structured, verifiable tasks before open-ended creative ones beats mixing them together by a measurable margin, because structured domains drive output entropy *down* while creative ones push it *up* — schedule them wrong and the entropy collapse from the structured phase quietly damages your open-ended capability Does training order reshape how models handle different task types?. The same principle scales down to the prompt: ordering few-shot examples from sparse-and-hard to dense-and-easy improves in-context performance with no difficulty labels at all Can representation sparsity order few-shot demonstrations effectively?. Curriculum, it turns out, is not just a pretraining concern — it operates at inference time too.
Learning also unfolds in *phases* that you can't reorder but can exploit. RL training reliably moves through two stages — first nailing execution correctness, then hitting a planning bottleneck where strategy becomes the limit — and concentrating optimization on the planning tokens during that second phase yields real gains Does RL training follow a predictable two-phase learning sequence?. Meanwhile RL has a darker structural side effect: within the first epoch it collapses onto a single dominant output format inherited from pretraining and suppresses the alternatives, and which format wins depends on model scale rather than which one performs best Does RL training collapse format diversity in pretrained models?. So the *process* doesn't just add skills — it prunes the space of what the model will express.
The flip side is retention: how do you learn new things without erasing old ones? The corpus converges on a clear lever — stay close to your starting point. Models trained to drift less from the base distribution (low KL drift) keep up to 70% of their proximity and, crucially, preserve *plasticity* — the ability to keep learning later tasks — while parameter-only methods stall when the domain shifts Does staying close to the base model preserve learning ability?. Decoding-time proxy tuning takes this to the extreme by never touching base weights at all, closing most of the alignment gap while *beating* direct fine-tuning on knowledge tasks, because direct fine-tuning corrupts knowledge stored in the lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Where the change lands in the network is itself a structural choice with consequences for what survives.
Underneath all of this is a quieter claim: structure isn't only imposed, it *emerges* from exposure. Networks learn to fire densely for familiar data and stay sparse for unfamiliar inputs as a natural consolidation of pretraining Is representational sparsity learned or intrinsic to neural networks?, and they spontaneously carve compositional tasks into isolated, ablatable subnetworks — a modularity that pretraining makes far more consistent Do neural networks naturally learn modular compositional structure?. You can even force that structure on purpose: training with sparse weights produces disentangled, human-readable circuits Can sparse weight training make neural networks interpretable by design?. And the deepest version of the argument is that the *learning signal's* structure changes the sample economics entirely — predicting your own latent representations recovers compositional hierarchies with a constant number of samples where token-level prediction needs exponentially more Why is predicting latents more sample-efficient than tokens?. The thread tying it together, from Wide & Deep's joint training of memorization-and-generalization halves Can one model memorize and generalize better than two? to anticipatory recovery: what a network can hold and learn is shaped as much by the choreography of training as by the data on the page.
Sources 12 notes
Language models finetuned on cyclically repeated documents exhibit anticipatory recovery—restoring performance on a document before encountering it again—a phenomenon that emerges and strengthens with model scale, contradicting monotonic catastrophic interference.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.
A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.
Wide & Deep models train memorization (cross-product features) and generalization (embeddings) together, allowing each component to specialize: the wide part becomes small because deep handles common cases, and deep doesn't overfit rare items because wide captures them. Ensembling requires both halves full-size.