Can intentional data-mixture design replace model scaling for rare task learning?

This explores whether you can teach a model rare or underrepresented tasks by carefully composing what it trains on — the mix, the order, the framing — instead of just making the model bigger.

This explores whether you can teach a model rare or underrepresented tasks by carefully composing what it trains on — the mix, the order, the framing — instead of just buying capability with parameters. The corpus's most direct answer reframes what scaling even *does*: bigger models aren't better at rare tasks because they can represent solutions small models can't. They're better because the extra capacity weakens the gradients from common tasks, so frequent examples stop overwriting the slowly-accumulating features that rare tasks depend on Why do larger models learn rare tasks better?. If the real bottleneck is *interference* rather than expressivity, then scaling is just an expensive way to buy room — and the same protection might be engineered directly by controlling which examples compete for gradient at which moment. That's the opening the question is pointing at.

Several notes show that opening is real. Ordering training data by rarity — fine-tuning on rare examples first because rarity signals where the model is furthest from its pretraining distribution — beats the standard easy-to-hard curriculum Does ordering training data by rarity actually improve language models?. Note that this reframes curriculum learning entirely: the goal isn't pedagogical scaffolding, it's managing distance from the pretraining distribution, which is exactly a data-mixture problem. Sequencing matters for a mechanical reason, too: structured tasks drive output entropy down while open-ended ones drive it up, and training the structured tasks first protects creative capabilities from entropy collapse — worth 6.2% over throwing everything in together Does training order reshape how models handle different task types?. So 'data-mixture design' isn't just *what* you include; it's *when*, and the when is doing work scaling can't.

There's a sharper cut from the function-calling work: decomposing one umbrella skill into seven explicit subtasks and training across them generalizes better than a bigger undifferentiated dataset, closing the gap with far larger frontier models Can breaking function calling into subtasks improve model generalization?. And data can overtake scale outright — student cross-encoders trained on enough augmented teacher-labeled data outperformed the very LLM teachers that labeled them, because broader input-distribution exposure beat raw teacher capacity Can smaller models outperform their LLM teachers with enough data?. Pair that with the finding that tiny models with deep-thin architectures beat balanced ones at the same parameter count Does depth matter more than width for tiny language models?, and you get a consistent theme: where capability comes from is more designable than the scaling-laws story implies.

Here's the thing you might not have come looking for: a lot of what fine-tuning teaches isn't task understanding at all — it's the *shape* of the output. Models trained on semantically empty or deliberately wrong instructions perform almost identically to correctly-trained ones, because what actually transfers is knowledge of the output space Does instruction tuning teach task understanding or output format?. If much of 'learning a task' is really learning a format distribution, then mixture design — making sure the rare output shapes are present and protected from being drowned out — is precisely the lever, and scaling is a blunt substitute for it. The same logic shows up at the extreme: decompose a hard problem finely enough and small non-reasoning models handle million-step tasks error-free, inverting the assumption that hard problems need big models Can extreme task decomposition enable reliable execution at million-step scale?.

The honest boundary: nothing here claims mixture design fully *replaces* scale across the board — these are targeted demonstrations on rare-task and specialized settings, not a general law. But the collective weight points one way. Scaling and data design often buy the same thing — protection of rare features from interference — and when you can engineer that protection directly through ordering, decomposition, rarity-weighting, and output-space coverage, the cheaper lever frequently wins. The frontier the corpus is gesturing at is less 'bigger model' and more 'better-composed diet.'

Sources 8 notes

Why do larger models learn rare tasks better?

Larger models succeed at rare tasks not because they can represent solutions smaller models cannot, but because abundant capacity weakens gradients on common tasks, preventing them from overwriting slowly-accumulating rare-task features. Data-mixture design may be cheaper than scaling.

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Can smaller models outperform their LLM teachers with enough data?

Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can extreme task decomposition enable reliable execution at million-step scale?

MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.

Can intentional data-mixture design replace model scaling for rare task learning?

Sources 8 notes

Next inquiring lines