How does probability mass concentration affect sampling diversity across model scales?

This explores why models that pile their probability onto a few favored outputs generate less varied samples — and how that tendency tracks (or doesn't) with model size.

This explores why models that pile their probability onto a few favored outputs generate less varied samples, and whether bigger models do this more. The most direct answer in the corpus is counterintuitive: bigger is not better for diversity. For synthetic data generation, models around 500M parameters produce *more* unique outputs per sample than larger ones, because larger models concentrate probability mass on their preferred completions — within a fixed sampling budget, that sharpness costs you variety Why aren't bigger models better for generating diverse outputs?. So concentration and scale interact in a way that punishes the assumption that a more capable model is also a more inventive sampler.

But scale isn't the only axis that controls where the mass lands — training does too, and often more decisively. Reinforcement learning that rewards only final-answer correctness sharpens the policy globally, concentrating mass on winning trajectories and draining diversity even on problems the model hasn't solved yet Does outcome-based RL diversity loss spread across unsolved problems?. The same entropy-collapse mechanism shows up in search agents, where RL squeezes exploration while supervised fine-tuning on diverse demonstrations preserves it Does reinforcement learning squeeze exploration diversity in search agents?. Interestingly, scale resurfaces here as a hidden variable: when RL collapses a model's many pretraining formats down to one dominant format, which format wins depends on model scale rather than on performance Does RL training collapse format diversity in pretrained models?. Concentration is happening, but where the peak forms is scale-dependent and largely invisible when you start from a proprietary base.

The corpus also pushes back on the simple "concentration = bad" story. If you measure diversity only among outputs that pass a quality bar, preference-tuned models turn out *more* semantically diverse than base models — base models just looked diverse because their spread covered incoherent, low-quality space Does preference tuning actually reduce the diversity of model outputs?. Whether tuning helps or hurts also depends on domain: it reduces lexical-syntactic variety in code (where convergence on correctness is the point) but increases it in creative writing Does preference tuning always reduce diversity the same way?. So "concentration" can mean pruning garbage or it can mean homogenization — the same mechanism, opposite value.

What makes this more than a per-model curiosity is that concentration converges *across* models. Analysis of 70+ models on 26K open-ended queries found an "Artificial Hivemind": different models independently land on near-identical responses, because overlapping training data and shared alignment procedures sculpt their probability mass into the same shape — quietly undermining the diversity you'd hope to get from ensembling across scales and vendors Do different AI models actually produce diverse outputs?. And the stakes compound over time: in self-improvement loops, diversity is what enables out-of-distribution generalization, and once it's lost the degradation is irreversible How do quality, diversity, and complexity affect synthetic data differently?. The thing you didn't know you wanted to know: the surest route to genuine sample diversity may be a *smaller*, lightly-tuned model, not a larger, heavily-aligned one.

Sources 8 notes

Why aren't bigger models better for generating diverse outputs?

Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does preference tuning actually reduce the diversity of model outputs?

When diversity is measured among quality-passing outputs rather than all outputs, preference-tuned models generate greater semantic diversity than base models. Base models appear more diverse only because their variance spans incoherent space.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

How do quality, diversity, and complexity affect synthetic data differently?

Quality drives in-distribution generalization, diversity enables out-of-distribution generalization, and complexity strengthens both. Current evaluation methods collapse these into a single quality metric, causing self-improvement loops to degrade through irreversible diversity loss.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about probability mass concentration and sampling diversity across LLM scales. The question remains open: does concentration mechanically reduce diversity, and does scale predictably control it?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A curated library identified these constraints:
- Models around 500M parameters produce more unique outputs than larger ones; bigger models concentrate probability mass on preferred completions, trading off diversity within fixed sampling budgets (~2025).
- Outcome-based RL sharpens policy globally, concentrating mass on winning trajectories and draining diversity even on unsolved problems (~2025).
- RL training for search agents squeezes exploration diversity; supervised fine-tuning on diverse demonstrations preserves it (~2025).
- When RL collapses pretraining formats, the dominant format that emerges is scale-dependent, not performance-dependent (~2025).
- 70+ models independently converge on near-identical responses on 26K open-ended queries—an "Artificial Hivemind" driven by overlapping training data and shared alignment (~2025).
- Quality-filtered diversity in preference-tuned models exceeds base models; diversity effects are domain-dependent (reduces lexical variety in code, increases it in creative writing) (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2504.07912 Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining (2025-04)
- arXiv:2510.22954 Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond) (2025-10)
- arXiv:2504.12522 Evaluating the Diversity and Quality of LLM Generated Content (2025-04)
- arXiv:2605.22817 Vector Policy Optimization: Training for Diversity Improves Test-Time Search (2026-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 500M-parameter optimum, the RL-induced entropy collapse, and the Hivemind convergence: has newer post-training (e.g., scaling laws for diversity, constitutional AI, process reward models, or open-weights alignment) relaxed these? Judge whether test-time compute (sampling temperature schedules, tree search, multi-agent orchestration, or adaptive decoding) now decouples scale from concentration. Separate the durable observation (models do concentrate mass) from the perishable claim (bigger always concentrates more).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Look for papers showing synthetic data generation, test-time inference orchestration, or fine-tuning regimes that *preserve* diversity at scale, or that show Hivemind convergence is avoidable.
(3) Propose 2 research questions that ASSUME the regime may have shifted: (a) Can diversity-aware training objectives scale beyond 500M without forfeiting concentration benefits? (b) Does multi-agent or ensemble-level decoding break Hivemind convergence?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does probability mass concentration affect sampling diversity across model scales?

Sources 8 notes

Next inquiring lines