Why aren't bigger models better for generating diverse outputs?
When generating many unique outputs within a fixed budget, does model size actually matter? Exploring whether the conventional wisdom of using larger models holds for diversity-focused tasks.
A non-obvious finding from Evaluating the Diversity and Quality of LLM Generated Content: when the goal is to generate as many unique outputs as possible within a fixed sampling budget — the canonical use case for synthetic data generation — smaller models around 500M parameters are often the most efficient choice. Larger models do not produce proportionally more unique outputs per sample.
This is consistent with a broader pattern. Larger models concentrate probability mass more tightly on their preferred outputs. They are less likely to sample widely across the output space and more likely to repeat their top candidates. For applications that want diversity within the cost of a fixed inference budget, this concentration is a liability. A smaller model with flatter output distributions can produce a broader spread of distinct outputs at the same compute cost.
The "use the biggest model you can afford" heuristic comes from tasks where each output is consumed individually — answering one question, writing one summary. For those tasks, output quality dominates and bigger is better. The synthetic-data and unique-content regime inverts the calculus. Each output contributes proportional to its distinctiveness from the others; aggregate value is set by the variance of the sampled set, not by the peak quality of any single output.
For practitioners building synthetic training data, this argues for choosing model size based on the value function. If quality per output matters most, use the largest model. If uniqueness per dollar matters most, drop to a smaller model and run more samples. The crossover point varies by task; the paper places it around 500M parameters for programming tasks but the principle is general.
The deeper observation is that "model capability" is not a single scalar that monotonically rewards scale. Different applications care about different summary statistics of the output distribution, and those summary statistics scale differently. Treating scale as universally beneficial is an artifact of evaluating models on single-output benchmarks.
Inquiring lines that use this note as a source 34
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- When does the right constraint beat additional model capacity?
- What production constraints should determine paradigm selection?
- Can few-shot examples narrow generative diversity in creative tasks?
- How do larger models maintain more parallel tasks than smaller models?
- Do larger models develop more abstract features than smaller ones?
- Do different domains require different types of model investment?
- Does the optimal model size depend on what capabilities you actually need?
- Can smaller models actually perform well on specific downstream tasks?
- How can smaller models help select useful data for larger models?
- What makes output convergence across models inevitable given input-side homogenization?
- Why does depth outperform width for sub-billion parameter models?
- How does the Ladder of Scales approach reduce search costs across model sizes?
- Why do smaller and larger models converge on different output formats?
- Why do production systems optimize for three model classes instead of foundation models?
- How do quality, diversity, and complexity create different effects on downstream model performance?
- Do small models show different parameter efficiency patterns than large models?
- Can multiple small models outperform a single large model with good routing?
- Which architectural choices matter most when a model must fit one billion parameters?
- Why might diverse smaller models with routing beat one giant model?
- What consumption data would validate the limited-consumption model in production systems?
- How does smooth generation lead to proliferation without new viewpoints?
- How does graph-based tool sampling differ from random sampling in diversity?
- What makes a small surgical wide component sufficient with a capable deep model?
- How do quality thresholds change which model produces more usable diversity?
- How should we evaluate diversity differently across programming and creative tasks?
- What makes creative writing diversity different from code diversity fundamentally?
- Does fine-tuning a small model match fine-tuning a large one?
- How does probability mass concentration affect sampling diversity across model scales?
- At what point does output quality outweigh diversity value in synthetic data tasks?
- What output distribution properties make smaller models better for wide sampling?
- How can expensive models efficiently support cheap models in production?
- What benefits do open foundation models create that closed systems cannot?
- How do complexity and diversity affect model performance differently?
- Can smaller models produce skill updates as useful as frontier model updates?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does preference tuning actually reduce the diversity of model outputs?
The field assumes RLHF and DPO reduce diversity, but this assumption rests on measuring all outputs equally. What happens if we only count diverse outputs that meet quality thresholds?
same paper, the broader diversity-metric reframing
-
Does preference tuning always reduce diversity the same way?
Explores whether the standard narrative that RLHF reduces model diversity holds equally across different task domains, or if the effect varies by what the domain rewards.
same paper, the domain-dependence
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Scaling Synthetic Data Creation with 1,000,000,000 Personas
- Personalized Dialogue Generation with Persona-Adaptive Attention
- Reasoning-Driven Synthetic Data Generation and Evaluation
- Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
- Foundation Priors
- Orchestrating Synthetic Data with Reasoning
- Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
- Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Original note title
smaller models around 500M parameters are most efficient for unique-output generation within a fixed sampling budget — parameter scale is not monotonic