Can training order and structure shape what networks retain and learn?

This explores whether *when* and *in what arrangement* a network sees its training material — not just what the material is — changes what it ends up able to do and remember, and the corpus says yes, in surprisingly mechanical ways.

This explores whether the *order* and *structure* of training — not just the data itself — change what a network retains and learns. The collection's answer is a fairly emphatic yes, and the most striking evidence is that sequencing produces effects you'd never predict from the data alone. When a model is finetuned on documents repeated in a fixed cycle, it starts to *anticipate* forgetting — restoring performance on a document right before it comes around again — a recovery behavior that gets stronger as models scale and directly contradicts the old story of monotonic catastrophic interference Do networks recover from forgetting before re-encountering documents?. Structure alone, here, buys you memory you didn't explicitly train for.

Order matters most clearly when task *types* interact. Training structured, verifiable tasks before open-ended creative ones beats mixing them together by a measurable margin, because structured domains drive output entropy *down* while creative ones push it *up* — schedule them wrong and the entropy collapse from the structured phase quietly damages your open-ended capability Does training order reshape how models handle different task types?. The same principle scales down to the prompt: ordering few-shot examples from sparse-and-hard to dense-and-easy improves in-context performance with no difficulty labels at all Can representation sparsity order few-shot demonstrations effectively?. Curriculum, it turns out, is not just a pretraining concern — it operates at inference time too.

Learning also unfolds in *phases* that you can't reorder but can exploit. RL training reliably moves through two stages — first nailing execution correctness, then hitting a planning bottleneck where strategy becomes the limit — and concentrating optimization on the planning tokens during that second phase yields real gains Does RL training follow a predictable two-phase learning sequence?. Meanwhile RL has a darker structural side effect: within the first epoch it collapses onto a single dominant output format inherited from pretraining and suppresses the alternatives, and which format wins depends on model scale rather than which one performs best Does RL training collapse format diversity in pretrained models?. So the *process* doesn't just add skills — it prunes the space of what the model will express.

The flip side is retention: how do you learn new things without erasing old ones? The corpus converges on a clear lever — stay close to your starting point. Models trained to drift less from the base distribution (low KL drift) keep up to 70% of their proximity and, crucially, preserve *plasticity* — the ability to keep learning later tasks — while parameter-only methods stall when the domain shifts Does staying close to the base model preserve learning ability?. Decoding-time proxy tuning takes this to the extreme by never touching base weights at all, closing most of the alignment gap while *beating* direct fine-tuning on knowledge tasks, because direct fine-tuning corrupts knowledge stored in the lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Where the change lands in the network is itself a structural choice with consequences for what survives.

Underneath all of this is a quieter claim: structure isn't only imposed, it *emerges* from exposure. Networks learn to fire densely for familiar data and stay sparse for unfamiliar inputs as a natural consolidation of pretraining Is representational sparsity learned or intrinsic to neural networks?, and they spontaneously carve compositional tasks into isolated, ablatable subnetworks — a modularity that pretraining makes far more consistent Do neural networks naturally learn modular compositional structure?. You can even force that structure on purpose: training with sparse weights produces disentangled, human-readable circuits Can sparse weight training make neural networks interpretable by design?. And the deepest version of the argument is that the *learning signal's* structure changes the sample economics entirely — predicting your own latent representations recovers compositional hierarchies with a constant number of samples where token-level prediction needs exponentially more Why is predicting latents more sample-efficient than tokens?. The thread tying it together, from Wide & Deep's joint training of memorization-and-generalization halves Can one model memorize and generalize better than two? to anticipatory recovery: what a network can hold and learn is shaped as much by the choreography of training as by the data on the page.

Sources 12 notes

Do networks recover from forgetting before re-encountering documents?

Language models finetuned on cyclically repeated documents exhibit anticipatory recovery—restoring performance on a document before encountering it again—a phenomenon that emerges and strengthens with model scale, contradicting monotonic catastrophic interference.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can representation sparsity order few-shot demonstrations effectively?

Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Can one model memorize and generalize better than two?

Wide & Deep models train memorization (cross-product features) and generalization (embeddings) together, allowing each component to specialize: the wide part becomes small because deep handles common cases, and deep doesn't overfit rare items because wide captures them. Ensembling requires both halves full-size.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking whether training-order and training-structure effects on network learning and retention remain empirically valid or have been superseded. The question: *Does the choreography of training — sequencing, task ordering, curriculum, phase structure — genuinely shape what networks retain and learn, independent of data content?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2016–2026; treat these as perishable claims to re-test against current models and methods.
- Cyclically repeated documents trigger anticipatory recovery behavior in scaled models, restoring performance before retraining (arXiv:2403.09613, ~2024)
- Task ordering by entropy dynamics (structured→creative) measurably outperforms shuffled curricula; entropy collapse from wrong sequencing damages open-ended capability (arXiv:2507.14783, ~2025)
- Few-shot example ordering (sparse-hard→dense-easy) improves in-context performance without explicit difficulty labels; curriculum operates at inference time (arXiv:2507.22887, ~2025)
- RL exhibits two-phase dynamics: procedural correctness first, then planning bottleneck; optimizing planning tokens in phase 2 yields gains (arXiv:2508.12790, ~2025)
- RL post-training converges on a single dominant pretraining format within epoch 1, suppressing alternatives; format dominance correlates with model scale, not performance (arXiv:2504.07912, ~2025)
- Low KL drift from base model preserves ~70% proximity and plasticity for continued learning; decoding-time proxy tuning beats direct fine-tuning on knowledge tasks while preserving base weights (arXiv:2410.08020, ~2024)
- Representational sparsity emerges from data familiarity; networks spontaneously modularize compositional tasks (arXiv:2603.03415, ~2026)
- Weight sparsity produces interpretable disentangled circuits (arXiv:2511.13653, ~2025); predicting own latents is exponentially more sample-efficient than token prediction (arXiv:2605.27734, ~2026)

Anchor papers (verify; mind their dates):
- arXiv:2301.10884 (Break It Down, 2023)
- arXiv:2507.14783 (Omni-Thinker, 2025)
- arXiv:2605.27734 (Learn from your own latents, 2026)
- arXiv:2605.12484 (Learning, Fast and Slow, 2026)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, assess whether newer frontier models (o3, o4 class; multimodal at scale), improved RL methods (reward modeling advances, process reward models), or new training infrastructure (distributed curricula, advanced scheduling frameworks, continual learning harnesses) have relaxed or overturned it. Separate the durable question (*Is order and structure intrinsically generative?*) from the perishable limitation (*This specific KL threshold, phase boundary, or recovery latency holds*). Cite what resolved any constraint; flag where it still appears to hold.
(2) **Surface the strongest contradicting or superseding work** from the last ~6 months. Does any paper show order/structure effects vanish under certain conditions (scale, domain, optimization regime)?
(3) **Propose 2 research questions** that assume the training-structure regime may have evolved: e.g., *Does continual-learning infrastructure (experience replay, orthogonal task initialization) eliminate the phase-boundary bottleneck?* or *Does emergent modularity under sparse training interact with multi-agent orchestration in ways that dissolve the single-format collapse?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can training order and structure shape what networks retain and learn?

Sources 12 notes

Next inquiring lines