Which aggregation method best exploits diversity in generated solutions?

This explores how to combine many candidate solutions into a better answer — and which combination strategy actually gets value out of their differences, rather than washing them out.

This explores how to combine many candidate solutions into a better answer, and which combination strategy actually gets value out of their differences. The corpus doesn't crown one winner so much as reveal a split between two families: methods that *select* (pick the best candidate or route to the best generator) and methods that *recombine* (search across candidates and merge their modes). The strongest signal points to recombination via search — but only when the upstream pool is genuinely diverse, which turns out to be the harder problem.

On the selection side, routing is the standout. Sending each query to the model best suited for it beats simply building one bigger model: a cluster-routing ensemble outperforms a frontier model by ~7% or matches it at far lower cost, and ten small models with a router previously surpassed much larger ones Can routing beat building one better model?. Notably, the winning move is a *pre-generation* decision — estimate query difficulty and pick the model before any solution is generated Can routers select the right model before generation happens?. That's selection at its leanest: it never aggregates multiple solutions at all, it just chooses the right source. It exploits diversity *across models* rather than diversity *within a candidate set*.

The recombination side aims higher. Vector Policy Optimization trains a model to emit several distinct competent solutions instead of converging on one, specifically so that downstream search — evolutionary algorithms that explore and *combine* modes — can solve problems an entropy-collapsed policy can't reach at all Should training maximize diversity when models feed into search?. This is the most direct answer to the literal question: the aggregation method that best exploits diversity is search-based mode combination, because it treats the spread of solutions as raw material to recombine, not noise to vote away.

But here's the catch the corpus keeps surfacing: most aggregation quietly fails because the diversity was never real. Ensembling many models assumes they disagree — yet 70+ models on open-ended queries collapse into an "Artificial Hivemind," producing near-identical outputs from overlapping training and alignment Do different AI models actually produce diverse outputs?. And the standard training recipe actively destroys the diversity aggregation depends on: outcome-based RL sharpens the policy globally, draining variety even on unsolved problems Does outcome-based RL diversity loss spread across unsolved problems?. Step-level critique during training counteracts this tail-narrowing and keeps solutions varied across self-training rounds Do critique models improve diversity during training itself?. So "which aggregation method" is half the question — the other half is whether anything diverse made it into the pool.

The quietest finding may be the most useful: diversity without competence doesn't aggregate into quality, it aggregates into noise. Multi-agent teams beat a solo agent only when members hold genuine domain expertise; diverse-but-shallow teams underperform a single competent one, because stimulation without grounding produces process losses instead of insight Does cognitive diversity alone improve multi-agent ideation quality?. The lesson across all of these: the best aggregation method is whichever one matches *where* your diversity actually lives — route when your models differ, search-and-recombine when your candidates are both varied and competent, and don't bother aggregating a pool that quietly converged.

Sources 7 notes

Can routing beat building one better model?

Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.

Can routers select the right model before generation happens?

RouteLLM and Hybrid-LLM both achieve 40-50% cost reduction by routing to a single model based on query difficulty prediction, not response evaluation. Single-model routing minimizes latency compared to ensemble or cascade alternatives.

Should training maximize diversity when models feed into search?

Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Does cognitive diversity alone improve multi-agent ideation quality?

Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about aggregation methods for diverse LLM solutions. The question remains open: which combination strategy actually exploits solution diversity?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable snapshots:
• Routing (pre-generation model selection) outperforms larger single models by ~7% or matches them at lower cost; it exploits cross-model diversity without aggregating multiple solutions (2024–2025).
• Search-based recombination (evolutionary algorithms combining solution modes) is the direct answer to diversity exploitation, but only when upstream diversity is genuine and competent (2026).
• Most aggregation fails silently: 70+ LLMs on open-ended tasks converge into near-identical outputs ("Artificial Hivemind"), collapsing assumed diversity (2025).
• Outcome-based RL globally sharpens policy, draining solution variety even on unsolved problems; step-level critique training counteracts this tail-narrowing (2024–2025).
• Diverse-but-shallow multi-agent teams underperform a single competent agent; expertise is non-negotiable for aggregation gains (2025).

Anchor papers (verify; mind their dates):
• arXiv:2404.14618 (2024): Hybrid LLM routing
• arXiv:2605.22817 (2026): Vector Policy Optimization
• arXiv:2510.22954 (2025): Artificial Hivemind convergence
• arXiv:2508.04575 (2025): Multi-agent quality drivers

Your task:
(1) RE-TEST convergence and competence constraints. Have newer training regimes (post-2025 alignment, open-weights scaling, synthetic-data diversity injection) since *created* genuine diversity where it was collapsing, or does the Hivemind effect still hold? Does outcome-based RL remain the dominant diversity killer, or have newer objective designs (e.g., process-level supervision, intrinsic motivation) preserved variance? Separate the durable insight (aggregate only what's both diverse AND competent) from what's been operationally solved.
(2) Surface work from the last ~6 months that directly contradicts the routing-vs-recombination split or shows a third family outperforming both.
(3) Propose 2 questions assuming the regime has shifted: (a) Can synthetic diversity injection at training time overcome Hivemind-like convergence, and if so, what density of synthetic personas suffices? (b) Does routing remain optimal if recombination methods learn to *generate* rather than just *search* across candidates?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Which aggregation method best exploits diversity in generated solutions?

Sources 7 notes

Next inquiring lines