SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals

How do quality, diversity, and complexity affect synthetic data differently?

When training models on synthetic data, do quality, diversity, and complexity each play distinct roles in how well models generalize? Understanding their separate effects could explain why current optimization strategies fail.

Synthesis note · 2026-05-03 · sourced from Data

Synthetic data generation methods proliferated rapidly but produced few directly comparable studies, because every method varied seeds, prompts, filters, and tasks simultaneously. The QDC framework proposes a cleaner basis for comparison: examine the quality, diversity, and complexity of resulting synthetic data, and trace how each characteristic maps to downstream model performance.

Three findings disentangle effects that previous work conflated. Quality is essential for in-distribution generalization — models learn to produce acceptable outputs only when training samples meet specification fidelity. Diversity is essential for out-of-distribution generalization — without sufficient variety in training, the model has no basis for handling distribution shifts. Complexity is beneficial for both, because complex examples push the model's representational capacity rather than merely confirming existing capability.

A critical structural observation follows: there is a Quality-Diversity trade-off in training data. Maximizing quality by tightening rejection criteria narrows the distribution. Maximizing diversity broadens the distribution but admits more low-fidelity samples. The trade-off is irreducible at the level of any single sample — a sample cannot simultaneously be maximally diverse from the typical case and maximally compliant with the typical specification.

The most consequential implication is for self-improvement. Models are typically evaluated and optimized only for output quality. This quality-only training narrows output diversity, which then becomes the synthetic data for the next training round, which has even less diversity, and so on. Self-improvement degrades because the data generator collapses toward the model's existing distribution — the model collapse mechanism in slow motion. Balancing QDC is therefore not a polish concern but a structural prerequisite for self-improvement to work — a system that does not preserve diversity cannot bootstrap beyond its current capabilities.

Inquiring lines that use this note as a source 16

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 148 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

quality diversity and complexity create distinct downstream effects in synthetic training data — and most pipelines optimize only quality which constrains self-improvement