Why aren't bigger models better for generating diverse outputs?

When generating many unique outputs within a fixed budget, does model size actually matter? Exploring whether the conventional wisdom of using larger models holds for diversity-focused tasks.

Synthesis note · 2026-05-18 · sourced from Evaluations

A non-obvious finding from Evaluating the Diversity and Quality of LLM Generated Content: when the goal is to generate as many unique outputs as possible within a fixed sampling budget — the canonical use case for synthetic data generation — smaller models around 500M parameters are often the most efficient choice. Larger models do not produce proportionally more unique outputs per sample.

This is consistent with a broader pattern. Larger models concentrate probability mass more tightly on their preferred outputs. They are less likely to sample widely across the output space and more likely to repeat their top candidates. For applications that want diversity within the cost of a fixed inference budget, this concentration is a liability. A smaller model with flatter output distributions can produce a broader spread of distinct outputs at the same compute cost.

The "use the biggest model you can afford" heuristic comes from tasks where each output is consumed individually — answering one question, writing one summary. For those tasks, output quality dominates and bigger is better. The synthetic-data and unique-content regime inverts the calculus. Each output contributes proportional to its distinctiveness from the others; aggregate value is set by the variance of the sampled set, not by the peak quality of any single output.

For practitioners building synthetic training data, this argues for choosing model size based on the value function. If quality per output matters most, use the largest model. If uniqueness per dollar matters most, drop to a smaller model and run more samples. The crossover point varies by task; the paper places it around 500M parameters for programming tasks but the principle is general.

The deeper observation is that "model capability" is not a single scalar that monotonically rewards scale. Different applications care about different summary statistics of the output distribution, and those summary statistics scale differently. Treating scale as universally beneficial is an artifact of evaluating models on single-output benchmarks.

Inquiring lines that use this note as a source 34

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 128 in 2-hop network ·dense cluster Open in graph ↗

Why aren't bigger models better for generating d… Does preference tuning actually reduce the diversi… Does preference tuning always reduce diversity the…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does preference tuning actually reduce the diversity of model outputs? The field assumes RLHF and DPO reduce diversity, but this assumption rests on measuring all outputs equally. What happens if we only count diverse outputs that meet quality thresholds?
same paper, the broader diversity-metric reframing
Does preference tuning always reduce diversity the same way? Explores whether the standard narrative that RLHF reduces model diversity holds equally across different task domains, or if the effect varies by what the domain rewards.
same paper, the domain-dependence

Why aren't bigger models better for generating diverse outputs?

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4