Can we train better models on less data?

Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.

Synthesis note · 2026-02-22 · sourced from Training Fine Tuning

LESS (Low-rank gradiEnt Similarity Search) selects instruction tuning data by estimating each example's influence on a target capability. Given a handful of examples embodying a specific skill (e.g., reasoning), LESS constructs a gradient datastore of low-dimensional features and selects training data whose gradient signatures are most similar to the target examples.

The headline result: training on LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. This is not just efficiency — it's a net improvement. The mechanism: mixed instruction tuning datasets contain examples that actively hinder specific capabilities. Since Does training data format shape reasoning strategy more than domain?, the wrong format examples can shift the model's reasoning strategy away from what the target task requires.

Three technical innovations make this practical for LLMs: (1) adaptation to the Adam optimizer (influence formulations traditionally assume SGD), (2) variable-length sequence handling (instruction data varies wildly in length, which derails standard gradient comparisons), and (3) low-rank gradient features that compress the storage and computation to feasible levels.

The transferability finding is striking: smaller models can select useful data for larger models, and models from different families can share data selections. This suggests the gradient-based quality signal captures something about the data's intrinsic fit with a capability — not just its fit with a particular model's current state. The qualitative analysis confirms this: LESS selects data that "goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills."

This connects to the broader pattern that data quality dominates data quantity. Can models improve themselves on tasks without verifiable answers? showed 1000 well-chosen examples can catalyze general self-improvement. Does teacher-refined data always improve student model performance? showed that data needs to match the student. LESS provides the principled mechanism for finding that match.

Inquiring lines that use this note as a source 15

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 126 in 2-hop network ·medium cluster Open in graph ↗

Can we train better models on less data? Can models improve themselves on tasks without ver… Does teacher-refined data always improve student m… Does training data format shape reasoning strategy… Does self-generated training data improve model le… What makes test-time training actually work in pra… Can careful selection of 78 demos outperform massi… Can careful curation replace massive alignment dat…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models improve themselves on tasks without verifiable answers? Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
complementary: LESS finds the right 5%, catalyst data shows 1000 examples suffice
Does teacher-refined data always improve student model performance? Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.
LESS provides the mechanism for student-aware selection
Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
explains why wrong data hurts: format mismatch shifts reasoning strategy
Does self-generated training data improve model learning? Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.
related: data-learner compatibility as the key variable
What makes test-time training actually work in practice? Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.
LESS provides the principled mechanism for TTT's first required component (task-similar finetuning): gradient-based influence estimation can identify the most relevant subset for the task-similar finetuning stage, making TTT's first component more efficient and less fragile than heuristic data selection
Can careful selection of 78 demos outperform massive training datasets? Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
LIMI's 78-trajectory result is the agentic analog of LESS's finding: strategic curation outperforms volume; LESS provides the mechanism (gradient-based selection) that could identify which agentic trajectories matter most
Can careful curation replace massive alignment datasets? Does fine-tuning a strong pretrained model on 1000 carefully selected examples achieve alignment quality comparable to models trained on vastly larger datasets? This challenges assumptions about data volume in post-training.
LIMA demonstrates the target state (1000 curated examples suffice for alignment); LESS provides the mechanism for reaching that state (gradient-based selection operationalizes what "careful curation" means computationally)

Can we train better models on less data?

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4