SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals

Can we prune training data without hurting model performance?

This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.

Synthesis note · 2026-02-22 · sourced from LLM Architecture
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

"Beyond Neural Scaling Laws" (2206.14486) challenges the assumption that scaling laws are fixed. Power-law scaling of error with dataset size implies massive redundancy — many training examples contribute marginally. If you can rank examples by difficulty or importance and prune the easy/redundant ones, you can beat the power law.

The theory proves exponential scaling is possible with an ideal pruning metric. The practice confirms better-than-power-law scaling on ResNets trained on CIFAR-10, SVHN, and ImageNet.

The pruning metrics reveal a taxonomy of training example difficulty:

The key insight: easy examples (low forgetting, low memorization, low EL2N) are redundant with the rest of the data. Hard examples are irreducibly necessary. Pruning easy examples preserves all the information that matters.

Since Can we train better models on less data?, the data pruning finding extends from instruction tuning to pretraining. The principle is the same — data efficiency comes from identifying the valuable subset — but the mechanisms differ. LESS uses gradient-based influence; data pruning uses difficulty metrics. Both converge on: most training data is redundant, and identifying the valuable fraction is the key optimization.

A practical challenge remains: most high-performing metrics are computationally expensive and require labels. The paper develops a self-supervised pruning metric that scales to ImageNet with comparable performance — making data pruning viable for large unlabeled corpora.

Inquiring lines that use this note as a source 11

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 164 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

data pruning based on difficulty metrics can achieve exponential rather than power-law scaling — not all training examples are equally valuable