Can evolutionary search unlock problems that best-of-n selection cannot solve?

This explores whether evolutionary search — keeping a diverse population of candidates and recombining them — can solve problems that simply sampling many answers and picking the best (best-of-n) cannot.

This explores whether evolutionary search — breeding and mutating a population of candidate solutions — can crack problems that best-of-n sampling (generate many, keep the winner) leaves unsolved. The corpus says yes, and it points to *why*: the advantage isn't more compute, it's diversity that survives long enough to recombine. Mind Evolution uses LLMs to generate mutations and crossovers and solves 98% of planning tasks, beating both Best-of-N and Sequential Revision; the key ingredient is an island model that keeps subpopulations apart so the search doesn't prematurely collapse onto one answer Can evolutionary search beat sampling and revision at inference time?. Best-of-n and single-trajectory refinement both throw away the population — they either sample independently with no memory or polish one line of attack — so neither can stumble into the combinations that evolution explicitly builds.

The deeper reason this works shows up when you look at what kills search. Diversity collapse is the recurring villain: RL training compresses search agents onto narrow reward-maximizing strategies through the same entropy collapse seen in reasoning models, and the fix is to preserve exploration breadth rather than chase a single optimum Does reinforcement learning squeeze exploration diversity in search agents?. Evolutionary methods are interesting precisely because diversity preservation is built into the algorithm. A striking parallel: diffusion models turn out to be mathematically equivalent to evolutionary algorithms — denoising performs selection, mutation, and reproductive isolation — and they outperform mainstream evolutionary methods specifically by *preserving multimodality* where traditional methods collapse to one solution Can diffusion models perform evolutionary search in parameter space?.

The really surprising payoff is that population search can reach capabilities no single starting point had. Swarms of LLM 'particles' moving through weight space discover composed experts that answer questions *all* the initial experts failed on — using only 200 validation examples and no gradient training Can language models discover new expertise through collaborative weight search?. That's the qualitative line best-of-n can't cross: best-of-n can only return the best thing already in its sample, while evolutionary recombination synthesizes something that wasn't in any individual candidate.

The same logic scales up into self-improving systems. The Darwin Gödel Machine keeps an evolutionary *archive* of agent variants and improves itself by empirical trial-and-error rather than proof, getting 2.5× on SWE-bench by discovering new skills like better code editing Can AI systems improve themselves through trial and error? — the archive is what lets it branch off old variants instead of greedily following one. And a bilevel autoresearch loop went further, rewriting its own search code at runtime to discover entirely new mechanisms, a 5× gain that came from *breaking* the inner loop's deterministic patterns Can an AI system improve its own search methods automatically?.

Two cautions keep this honest. First, evolution's edge isn't universal — on structured, genuinely sequential problems like graph connectivity, chain-of-thought's step-by-step accumulation beats parallel approaches outright, because the answer has to be built in order When does sequential reasoning beat parallel voting?. Evolutionary search wins where the solution space is wide and multimodal, not where it's a single narrow chain. Second, pure self-driven search hits a wall: self-improvement is structurally circular and stalls on the generation-verification gap and diversity collapse unless it smuggles in an external signal — a judge, a benchmark, user corrections, tool feedback Can models reliably improve themselves without external feedback?. Notice that every successful evolutionary system above quietly satisfies this: Mind Evolution has a task evaluator, the swarm has validation examples, the DGM has empirical benchmarks. The honest answer, then, is that evolutionary search unlocks problems best-of-n can't — but only when diversity is protected *and* an external fitness signal exists to select on.

Sources 8 notes

Can evolutionary search beat sampling and revision at inference time?

Mind Evolution uses genetic algorithms with LLM-generated mutations and crossovers to significantly outperform Best-of-N and Sequential Revision on planning benchmarks. An island model sustains population diversity, preventing the premature convergence that single-trajectory refinement exhibits.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can diffusion models perform evolutionary search in parameter space?

Denoising in diffusion models performs selection, mutation, and reproductive isolation—the core mechanisms of evolution. Diffusion Evolution empirically outperforms mainstream evolutionary algorithms by preserving multimodality where traditional methods collapse to single solutions.

Can language models discover new expertise through collaborative weight search?

PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.

Can AI systems improve themselves through trial and error?

DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.

Can an AI system improve its own search methods automatically?

An outer loop successfully read inner loop code, identified bottlenecks, and generated new Python mechanisms at runtime, discovering combinatorial optimization and bandit methods that broke the inner loop's deterministic patterns and improved performance on GPT pretraining by 5x.

When does sequential reasoning beat parallel voting?

On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can evolutionary search unlock problems that best-of-n selection cannot solve?

Sources 8 notes

Next inquiring lines