How does training order affect knowledge acquisition in language models?

This explores whether the *sequence* in which a model encounters training data — not just what data it sees — changes what it ends up knowing, and the corpus has a surprisingly contrarian set of answers.

This explores whether the *order* of training, not just its contents, shapes what a language model actually learns — and the collection suggests order matters in ways that overturn the obvious intuition.

The most direct challenge is to the classic "easy-to-hard" curriculum idea. One line of work argues the useful axis isn't conceptual difficulty at all but *distributional rarity*: training on rare data first, because rarity signals where the model's pre-training distribution is weakest, beats standard curricula Does ordering training data by rarity actually improve language models?. Order is reframed as managing distance from what the model already absorbed, not pedagogical scaffolding. A parallel finding in multi-task RL shows order acts almost mechanically on the model's internals: structured domains drive output entropy down while creative ones drive it up, so training structured tasks *first* protects open-ended capabilities from entropy collapse — a concrete 6%+ gain purely from sequencing Does training order reshape how models handle different task types?.

What's surprising is that order can shape *how* knowledge is represented, not only how well it sticks. Pre-pretraining on hierarchical formal languages before natural text cuts the natural-language tokens needed by a third, and the attention heads built during that early phase stay load-bearing for syntax later — an early ordering decision leaves a permanent structural fingerprint Can formal language pretraining make language models more efficient?. Even reasoning itself can be installed at a particular point in the pipeline: looped pretraining bakes iterative computation into latent space during pretraining rather than bolting it on afterward Can reasoning happen in latent space during pretraining?.

Order also interacts with forgetting in a way that contradicts the standard "new training overwrites old" story. Networks trained on cyclically repeated documents show *anticipatory recovery* — they restore performance on a document right before they re-encounter it — and this only emerges at scale, suggesting models learn the rhythm of a training schedule, not just its content Do networks recover from forgetting before re-encountering documents?. The same structural sensitivity shows up in-context: sequential-decision learning needs whole trajectories from the same setting clustered together ("burstiness"), not scattered examples Why do trajectories matter more than individual examples for in-context learning?. Sequence is a learnable signal at every timescale.

The thing you didn't know you wanted to know: order governs *acquisition*, but it also creates a hard ceiling that order can't move. Whatever entered during training dominates afterward — strong parametric priors override information sitting right there in the context window Why do language models ignore information in their context?, and prompting can only reactivate knowledge that training already deposited, never inject what was missing Can prompt optimization teach models knowledge they lack?. That's why systems increasingly learn *when* to reach outside their weights instead of relying on what training order left behind When should language models retrieve external knowledge versus use internal knowledge?. Training order decides what becomes the immovable prior — and everything downstream is a negotiation with it.

Sources 9 notes

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can formal language pretraining make language models more efficient?

Pre-pretraining 1B models on hierarchical formal languages achieves equivalent loss and better syntactic generalization using 33% fewer natural language tokens. The mechanism persists: attention heads trained on formal languages remain critical for syntactic performance on natural language.

Can reasoning happen in latent space during pretraining?

Ouro models achieve 2–3× efficiency gains by performing iterative reasoning in latent space during pretraining, not through extra capacity. Their intermediate predictions align faithfully with final outputs, making latent traces more honest than explicit chain-of-thought reasoning.

Do networks recover from forgetting before re-encountering documents?

Language models finetuned on cyclically repeated documents exhibit anticipatory recovery—restoring performance on a document before encountering it again—a phenomenon that emerges and strengthens with model scale, contradicting monotonic catastrophic interference.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

How does training order affect knowledge acquisition in language models?

Sources 9 notes

Next inquiring lines