Does ordering training data by rarity actually improve language models?

Can sorting rare sentences before common ones during fine-tuning help LLMs learn more effectively? This challenges the intuition that models should see easy examples first.

Synthesis note · 2026-05-02 · sourced from Natural Language Inference

Curriculum Textual Frequency Training (CTFT) is the third leg of Adam's Law's framework, and it inverts the intuitive curriculum-learning directionality. Standard curriculum learning sorts examples easy-to-hard along a conceptual difficulty axis: simple arithmetic before multi-step proofs, short translations before long ones. CTFT instead sorts examples by sentence-level corpus frequency and feeds the model the rare sentences first and the common sentences last. Rare comes first because rare is what the model's prior is weak on; saving the dense, well-modeled region for the end stabilizes the trajectory.

The reframe matters more than the technique. For an LLM, "easy" and "hard" are not properties of the concept being expressed — they are properties of the distance from the pre-training distribution. A formally simple sentence in a rare register can be harder for the model than a complex sentence in a textbook register. This connects to Does gradually tightening token budgets beat fixed budget training?: both findings argue that curriculum design for LLMs is fundamentally about managing distributional pressure, not pedagogical scaffolding. It also extends Does training data format shape reasoning strategy more than domain?: format and frequency are both statistical-position properties that drive learning more than the semantic content of the examples.

The methodological lesson generalizes beyond CTFT itself. Any curriculum-design choice for LLMs that uses the human-facing "easy/hard" gloss without checking distributional position is partly mis-specified. The replacement frame is "near/far from prior" — the model finds near-prior examples easy not because they are simple but because they are dense, and far-prior examples hard not because they are complex but because they are sparse. CTFT's contribution is operationalizing that frame into a concrete sentence-frequency ordering, with story-completion distillation (TFD) as the closed-source workaround for estimating frequencies on models whose training data we cannot see directly.

Inquiring lines that use this note as a source 17

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 141 in 2-hop network ·dense cluster Open in graph ↗

Does ordering training data by rarity actually i… Does gradually tightening token budgets beat fixed… Does training data format shape reasoning strategy…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does gradually tightening token budgets beat fixed budget training? Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
curriculum design as distributional pressure management
Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
format and frequency both override domain content

Does ordering training data by rarity actually improve language models?

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4