How much alignment data does a language model actually need to specialize well?

This explores how much fine-tuning data it really takes to make a base model behave well for a task — and the corpus's answer is that quality and method matter far more than sheer volume.

This reads the question as being about the *quantity* of alignment data needed to specialize a model — and the corpus's most striking claim is that the honest answer is "surprisingly little, if the data is good." The clearest evidence is LIMA, which fine-tuned a strong pretrained model on just 1,000 carefully curated examples and reached performance competitive with models trained on orders of magnitude more data Can careful curation replace massive alignment datasets?. The reason is conceptual, not just empirical: post-training mostly *activates capabilities the base model already has* rather than installing new ones. If alignment is surfacing latent ability, then curation beats quantity, and a thousand sharp examples can outperform a million noisy ones.

That reframing — alignment as activation, not construction — also tells you *which* data is worth curating. Several notes suggest the highest-value examples are the ones that teach the model what *not* to do. Small models trained with DPO on paired correct/incorrect function-calling examples from a larger teacher beat plain supervised fine-tuning, because the explicit negative examples directly target the rigid format failures that SFT leaves untouched Can small models match large models on function calling?. So "how much data" is the wrong axis on its own; the better question is how much *contrastive signal* the data carries.

There's a catch worth knowing about, though: more fine-tuning data can actively hurt you. Direct weight fine-tuning corrupts knowledge stored in a model's lower layers, while proxy-tuning at decoding time closes 88–91% of the alignment gap *and* preserves pretrained knowledge better, because it never touches the base weights Can decoding-time tuning preserve knowledge better than weight fine-tuning?. This means there's a real tension: aggressive specialization trades away general competence. The cheapest, lightest-touch alignment isn't just convenient — it's sometimes the only way to avoid catastrophic forgetting.

The corpus also warns against mistaking more data for deeper learning. RL fine-tuning often *sharpens memorization* rather than installing reasoning: GRPO-trained models look strong in-distribution but collapse on out-of-distribution variants, suggesting the extra training tightened template-matching rather than teaching a procedure Do fine-tuned language models actually learn optimization procedures?. And there's a hard ceiling no amount of self-generated data can break — self-improvement is formally bounded by the generation-verification gap, so reliable gains require something external to validate them What stops large language models from improving themselves?. Piling on synthetic alignment data from the model itself runs straight into that wall.

The thing you might not have expected to learn: chasing volume can quietly homogenize your model. When 70+ models were tested on open-ended queries, they converged on near-identical answers — an "Artificial Hivemind" driven precisely by overlapping training data and shared alignment procedures Do different AI models actually produce diverse outputs?. So the case for small, curated, contrastive, light-touch alignment isn't only about efficiency. Specializing on less, more deliberately chosen data may be what keeps a model both knowledgeable and distinct.

Sources 6 notes

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

How much alignment data does a language model actually need to specialize well?

Sources 6 notes

Next inquiring lines