Does ordering training data by rarity actually improve language models?
Can sorting rare sentences before common ones during fine-tuning help LLMs learn more effectively? This challenges the intuition that models should see easy examples first.
Curriculum Textual Frequency Training (CTFT) is the third leg of Adam's Law's framework, and it inverts the intuitive curriculum-learning directionality. Standard curriculum learning sorts examples easy-to-hard along a conceptual difficulty axis: simple arithmetic before multi-step proofs, short translations before long ones. CTFT instead sorts examples by sentence-level corpus frequency and feeds the model the rare sentences first and the common sentences last. Rare comes first because rare is what the model's prior is weak on; saving the dense, well-modeled region for the end stabilizes the trajectory.
The reframe matters more than the technique. For an LLM, "easy" and "hard" are not properties of the concept being expressed — they are properties of the distance from the pre-training distribution. A formally simple sentence in a rare register can be harder for the model than a complex sentence in a textbook register. This connects to Does gradually tightening token budgets beat fixed budget training?: both findings argue that curriculum design for LLMs is fundamentally about managing distributional pressure, not pedagogical scaffolding. It also extends Does training data format shape reasoning strategy more than domain?: format and frequency are both statistical-position properties that drive learning more than the semantic content of the examples.
The methodological lesson generalizes beyond CTFT itself. Any curriculum-design choice for LLMs that uses the human-facing "easy/hard" gloss without checking distributional position is partly mis-specified. The replacement frame is "near/far from prior" — the model finds near-prior examples easy not because they are simple but because they are dense, and far-prior examples hard not because they are complex but because they are sparse. CTFT's contribution is operationalizing that frame into a concrete sentence-frequency ordering, with story-completion distillation (TFD) as the closed-source workaround for estimating frequencies on models whose training data we cannot see directly.
Inquiring lines that use this note as a source 17
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do rare linguistic registers differ from conceptually complex examples?
- Why do rare complex structures in training data harm LLM generalization?
- Why do weaker models generate better training data than stronger models?
- Why does training order matter across different domain types?
- Why do older datasets show higher LLM performance than newer ones?
- How does distributional shift toward rare inputs change memorization reliance?
- How does consolidation schedule order affect final memory quality?
- Why does the order of training examples matter for what models learn?
- How do changes in human and AI writing distributions shift rarity measures over time?
- Does statistical rarity actually correlate with originality that law should protect?
- Why does curriculum order matter when information theory says data order is irrelevant?
- Does sparsity-guided ordering work equally well for reasoning and classification tasks?
- How does the pretraining distribution shape what LLMs find hard?
- Why do frequent words rank higher in taxonomic abstraction hierarchies?
- How does training order affect knowledge acquisition in language models?
- Can intentional data-mixture design replace model scaling for rare task learning?
- Why are rare tokens the hooks for verbatim model memorization?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does gradually tightening token budgets beat fixed budget training?
Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
curriculum design as distributional pressure management
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
format and frequency both override domain content
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Adam's Law: Textual Frequency Law on Large Language Models
- Premise Order Matters in Reasoning with Large Language Models
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Exploring Format Consistency for Instruction Tuning
- Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
- Long-context LLMs Struggle with Long In-context Learning
Original note title
curriculum textual frequency training reverses easy-to-hard intuition by ordering data low-to-high frequency