Can careful curation replace massive alignment datasets?
Does fine-tuning a strong pretrained model on 1000 carefully selected examples achieve alignment quality comparable to models trained on vastly larger datasets? This challenges assumptions about data volume in post-training.
LIMA ("Less Is More for Alignment") establishes a foundational finding: given a strong pretrained language model, remarkably strong alignment performance can be achieved by fine-tuning on just 1,000 carefully curated training examples. This is the alignment-specific instantiation of a broader principle that pretraining does the heavy lifting and post-training is primarily about activating existing capabilities.
The finding connects to a converging evidence pattern across the vault:
- Can a single training example unlock mathematical reasoning? — one example activates reasoning in RLVR; 1000 activate alignment in SFT
- Can careful selection of 78 demos outperform massive training datasets? — 78 curated trajectories for agentic behavior; same principle
- Can models improve themselves on tasks without verifiable answers? — identical count (1000) for reasoning catalyst
The consistent pattern: post-training interventions require far less data than assumed, but the quality bar is high. Random data at scale underperforms curated data at small scale. This is the "Less Is More" principle — the pretrained model already contains the capabilities; post-training teaches it when and how to deploy them, not what they are.
For alignment specifically, the implication challenges the industry's data collection approach. Massive RLHF annotation efforts with thousands of labelers may be optimizing the wrong variable. Careful curation of a small number of high-quality examples, targeting the specific behavioral patterns desired, may achieve comparable results at a fraction of the cost.
Inquiring lines that use this note as a source 36
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can communication problems and optimization problems be addressed with the same alignment approaches?
- Why does RLHF alignment reduce the diversity of viewpoints in AI output?
- Why does even 0.1 percent poisoned training data persist through alignment?
- What quality of curated data is minimally sufficient for alignment?
- Can a single AI system optimize multiple alignment dimensions simultaneously?
- Does selecting examples from multiple complexity levels outperform selecting only high-quality examples?
- Why does training data format matter more than domain content?
- How much alignment data does a language model actually need to specialize well?
- Why does training data format matter more than its domain content?
- Can selecting the right data subset outperform training on everything?
- Why do small training data contaminations persist through alignment for most attack types?
- Can alignment methods like DPO exploit or correct these surface feature biases?
- Why does KTO skip supervised fine-tuning while DPO cannot?
- Does removing cognitive bias from training signals accidentally break what makes alignment work?
- Can reward-guided decoding replace weight fine-tuning for personalized alignment?
- How does data quality mismatch create reasoning degradation in supervised fine-tuning?
- Why does post-training suppress alignment faking in some models but amplify it in others?
- Does gradient-based influence estimation identify which alignment examples actually matter most?
- What specific behavioral patterns should alignment examples target for maximum effect?
- What makes provenance infrastructure more critical than artifact quality?
- Can alignment training create systematic blind spots in threat detection systems?
- Why do alignment values become problematic as language models scale?
- What alignment procedures cause different models to share the same output distribution?
- How does upstream value embedding differ from downstream alignment patches?
- What preference data do different personalized alignment methods actually need?
- How much does pretraining quality affect the modularity of fine-tuned models?
- Do alignment benchmarks measure actual bias removal or only verbal compliance?
- Can mechanistic interpretability tools decode the biases alignment training conceals?
- Why does safety alignment break after only 10 harmful examples?
- Can weak models supervise the alignment of stronger models effectively?
- How does constitutional alignment compare to RLHF in removing human annotation costs?
- Can alignment procedures be redesigned to serve multiple preference groups?
- Does pretraining data size matter less than base model scale for finetuning?
- Can AI-assisted alignment eventually solve fairness at scale?
- Can instruction prompts reliably steer an LLM judge toward specific alignment targets?
- Can preference trees structure alignment data for domains beyond math and code?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can a single training example unlock mathematical reasoning?
Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.
extreme data efficiency for reasoning; LIMA is the alignment parallel
-
Can careful selection of 78 demos outperform massive training datasets?
Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
curation > volume for agentic behavior
-
Can models improve themselves on tasks without verifiable answers?
Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
same count, same principle, different domain
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the theoretical foundation: post-training activates, it doesn't create
-
Can we train better models on less data?
Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.
LESS provides the principled mechanism for LIMA-style curation: gradient-based influence estimation can identify which alignment examples matter most, operationalizing "careful curation" as a computable selection criterion rather than manual judgment
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Foundations of Large Language Models
- LIMA: Less Is More for Alignment
- Advancing LLM Reasoning Generalists with Preference Trees
- Model Organisms for Emergent Misalignment
- Automated Alignment Researchers: Using large language models to scale scalable oversight
- ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-Making
- DataComp-LM: In search of the next generation of training sets for language models
- Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation
Original note title
1000 carefully curated alignment examples achieve remarkably strong performance — alignment is primarily about data quality not quantity