How do quality, diversity, and complexity create different effects on downstream model performance?

This explores how three properties of training data — quality, diversity, and complexity — pull downstream model performance in *different directions*, rather than all feeding one generic 'better data' dial.

This explores how quality, diversity, and complexity act on downstream performance as separate levers — and why collapsing them into a single 'good data' score quietly breaks models over time. The cleanest map of this comes from work showing each property does a distinct job: quality drives *in-distribution* generalization (doing well on the kind of data you trained on), diversity drives *out-of-distribution* generalization (holding up on the unfamiliar), and complexity strengthens both at once How do quality, diversity, and complexity affect synthetic data differently?. The trap is that most evaluation pipelines measure only quality and treat diversity as noise — so self-improvement loops keep optimizing the one number they can see while irreversibly bleeding out the diversity they never tracked.

That blind spot shows up everywhere once you look. Pure self-improvement stalls precisely because of diversity collapse, the generation-verification gap, and reward hacking — and the methods that actually work smuggle in some external anchor (a past checkpoint, a judge, a user correction) to refill what the loop drains Can models reliably improve themselves without external feedback?. RL post-training makes the mechanism vivid: within the first epoch it amplifies one dominant pretraining format and suppresses the rest, and which format 'wins' tracks model scale rather than performance Does RL training collapse format diversity in pretrained models?. So the quality-optimizing pressure isn't neutral toward diversity — it's actively corrosive unless something counteracts it.

But 'diversity' itself splits into two things that are easy to confuse, and this is the part most readers don't expect. Raw output variance isn't the same as *useful* variance. When you measure diversity only among outputs that pass a quality bar, preference-tuned models turn out to be *more* semantically diverse than base models — base models just looked diverse because their variance sprawled across incoherent space Does preference tuning actually reduce the diversity of model outputs?. And the effect of preference tuning even reverses by domain: RLHF compresses lexical diversity in code (where convergence on the correct answer is the goal) but expands it in creative writing (where distinctiveness is rewarded) Does preference tuning always reduce diversity the same way?. Diversity, in other words, is only good relative to what the task wants.

The most encouraging thread is that quality and diversity aren't doomed to trade off — you can make them reinforce each other if you optimize for both explicitly. DARLING rewards semantic diversity *during* RL and finds it catalyzes exploration, producing higher-quality outputs than quality-only baselines on both creative and math tasks Can diversity optimization improve quality during language model training?. Step-level critique models do something similar inside the training loop, counteracting the 'tail narrowing' that kills solution variety across self-training iterations Do critique models improve diversity during training itself?. Counterintuitively, smaller ~500M-parameter generators produce more unique samples per budget than big models, which concentrate probability mass on their favorites Why aren't bigger models better for generating diverse outputs? — so for synthetic data, the diversity lever and the scale lever can point opposite ways.

Complexity — the third property — has the sharpest cautionary tale. More demanding training data helps, but only up to a point: overly hard RLVR samples push models to learn degenerate shortcuts (answer repetition, skipped computation) that then *contaminate* capabilities they already had, because rare accidental successes get treated as high-value trajectories Do overly hard RLVR samples actually harm model capabilities?. The benign-looking version of this is instruction density, which degrades performance in predictable patterns — linear, exponential, or a threshold cliff — depending on model type How does instruction density affect model performance?. The throughline across all three levers: each helps a *different* thing, each fails in a *different* way, and the moment you fold them into one metric you lose the ability to see which one is breaking.

Sources 10 notes

How do quality, diversity, and complexity affect synthetic data differently?

Quality drives in-distribution generalization, diversity enables out-of-distribution generalization, and complexity strengthens both. Current evaluation methods collapse these into a single quality metric, causing self-improvement loops to degrade through irreversible diversity loss.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does preference tuning actually reduce the diversity of model outputs?

When diversity is measured among quality-passing outputs rather than all outputs, preference-tuned models generate greater semantic diversity than base models. Base models appear more diverse only because their variance spans incoherent space.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Why aren't bigger models better for generating diverse outputs?

Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

How does instruction density affect model performance?

IFScale benchmark shows three degradation patterns: linear (small models), exponential (mid-range), and threshold decay (reasoning models maintain ~150 instructions then fail steeply). Even best models reach only 68% accuracy at maximum density.

How do quality, diversity, and complexity create different effects on downstream model performance?

Sources 10 notes

Next inquiring lines