Can selecting the right data subset outperform training on everything?

This explores whether curating a smaller, well-chosen training set can beat the brute-force approach of training on all available data — and the corpus answers with a fairly emphatic yes, while explaining *why* extra data can actively hurt.

This explores whether picking the right subset of training data can outperform using everything, and the collection's strongest finding is that it routinely can — sometimes dramatically. LESS uses gradient similarity to pick the 5% of instruction examples most aligned with a target capability, and training on that sliver consistently beats training on the full set Can we train better models on less data?. LIMA pushes the same idea to the extreme: 1,000 carefully curated alignment examples on a strong base model match models trained on orders of magnitude more data, because post-training mostly *activates* capabilities the model already has rather than teaching new ones Can careful curation replace massive alignment datasets?. And in vision, ranking examples by difficulty and pruning the redundant easy ones beats the usual power-law scaling — 50% of CIFAR-10 thrown away with no accuracy loss Can we prune training data without hurting model performance?.

The more interesting question is *why* less can beat more, and here the corpus reframes 'extra data' as not neutral but actively harmful. LESS's own explanation is that mixed datasets contain examples that hinder a specific skill by nudging the model's reasoning strategy away from what the task needs Can we train better models on less data?. In RL, overly hard samples are worse than useless: models learn degenerate shortcuts on near-impossible problems, and those shortcuts then contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. So 'train on everything' silently smuggles in examples that drag specific skills down.

A subtler thread is that the *right* subset is relative to the learner, not absolute. Teacher-refined data that is objectively higher quality still degrades a student model when it sits beyond the student's learning frontier — the student should filter refinements against its own statistical profile and keep only what's compatible Does teacher-refined data always improve student model performance?. That means there's no universal 'best subset': selection has to be conditioned on the model doing the learning, which is exactly the move LESS makes with target-aware gradients.

Worth noting that selection doesn't have to be a separate preprocessing step — it can be folded into training itself. DRO reuses a single cross-rollout variance statistic both to weight tokens and to filter out degenerate queries on the fly, getting 2–3× faster training by discarding bad comparisons mid-stream Can one statistical measure serve dual purposes in RL training?. So 'curate then train' and 'filter while training' are two faces of the same insight.

The thing you might not have expected to learn: the case for subset selection isn't really about saving compute. It's that the full dataset is a mixture of helpful, useless, and outright harmful examples, and the harm doesn't average out — it transfers into the model as forgotten skills and learned shortcuts. Curation wins not because small is efficient, but because big quietly includes its own poison.

Sources 6 notes

Can we train better models on less data?

LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Can we prune training data without hurting model performance?

Research shows that ranking training examples by difficulty (EL2N, forgetting, memorization) and removing easy ones beats power-law scaling laws. On CIFAR-10, 50% of data was pruned without accuracy loss, and self-supervised metrics scaled the approach to ImageNet.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: Can selecting the right data subset outperform training on everything? This remains open — capability gains from curation may be *model-dependent* and *task-dependent*, and we don't yet know how far the gains scale or whether they hold under continual learning.

What a curated library found — and when (dated claims, not current truth): Findings span 2022–2026.
• LESS achieves full-model performance with 5% of instruction data via gradient-based influence scoring; gains are target-capability-specific (2024).
• LIMA matches large models with 1,000 curated examples, suggesting post-training *activates* rather than teaches; but this claim was validated only on alignment, not on novel reasoning (2024).
• Vision pruning removes 50% of CIFAR-10 with zero accuracy loss; hard-negative RL samples induce degenerate shortcuts that contaminate learned capabilities (2025–2026).
• Student models degrade on teacher-refined data if it exceeds their learning frontier; curation must be conditioned on the learner's profile, not universal (2024).
• In-stream filtering (DRO, cross-rollout variance) achieves 2–3× faster training by discarding harmful comparisons mid-training, suggesting selection needn't be preprocessing (2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.04333 (LESS, 2024)
• arXiv:2206.14486 (data pruning via power-law escape, 2022)
• arXiv:2605.28388 (mechanistic role of sample difficulty in RLVR, 2026)
• arXiv:2605.12484 (continual adaptation, 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For LESS's 5% claim: Has this held as model scale, domain, and instruction-following maturity have grown? For LIMA's 1K figures: Do newer foundation models still plateau at 1K aligned examples, or does scale demand more? For student-learner filtering: Has this been validated on student–teacher pairs beyond alignment (e.g., code, reasoning)? Separate durable insight (harmful data exists and transfers) from perishable numbers (5%, 1K, 50%).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers claiming *scaling* defeats curation, or arguing the gains are marginal under compute-optimal regimes, or showing subset selection fails on out-of-distribution tasks.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Does continual learning invalidate fixed-subset curation?" or "Do emergent capabilities require dense, non-curated data at specific scales?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can selecting the right data subset outperform training on everything?

Sources 6 notes

Next inquiring lines