SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

What limits how much models can improve themselves?

Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.

Synthesis note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

"Mind the Gap" (Song et al., 2025) formalizes the precondition for self-improvement: the generation-verification gap, defined as the difference between a model's ability to verify solutions versus its ability to generate them. When this gap is positive, self-improvement has room to operate — the model can evaluate outputs better than it can produce them, creating a usable training signal.

The gap scales monotonically with pretraining FLOPs. Larger models have proportionally larger generation-verification gaps, which explains why self-improvement methods work better on larger models. For 4×4 Sudoku (NP-hard generation, P verification), only the largest models (72B+) show non-trivial gaps, with 50-300% accuracy improvement.

However, the gap vanishes for factual recall tasks. On Natural Questions, the gap is <1% or negative across all model sizes — verification provides no additional signal because knowing the answer and verifying the answer require the same factual knowledge. This predicts which tasks will benefit from self-improvement and which won't: tasks where generation is computationally harder than verification (math, code, structured problems) benefit; tasks where both require the same knowledge (factual QA) don't.

The diversity collapse finding is equally important: during iterative self-improvement, pass@k increases for small k (quality improves at the top) but decreases for large k (diversity decreases overall). The model converges on solutions it can verify, which are typically common patterns. Rare but correct solutions get filtered out because the model can't verify them. This is the entropy collapse dynamic operating through the verification bottleneck rather than through the policy directly.

The non-overlap property of verification mechanisms — different verifiers catch different errors despite functional similarity — suggests that compositional verification (combining multiple verification approaches) could substantially extend the ceiling. This is architecturally distinct from the temporal anchoring solution in Why does self-rewarding training collapse when responses improve? — one fixes the preference signal, the other expands the verification surface.

Promptbreeder as a practical bound-pusher for prompt optimization: Promptbreeder (Fernando et al., 2023) demonstrates a practical approach to push against these bounds for prompt optimization specifically. It overcomes APE's "diminishing returns after three rounds" through a diversity-maintaining evolutionary algorithm where mutation-prompts (instructions for modifying task-prompts) evolve alongside task-prompts — self-referential self-improvement grounded in LLMs. Promptbreeder outperforms CoT and Plan-and-Solve on arithmetic and commonsense reasoning. However, the self-improvement is still bounded by the LLM's generation capability — mutation-prompts can only express modifications the model can articulate, and fitness evaluation depends on the model's own outputs. This makes Promptbreeder a concrete instantiation of the gap framework: the generation-verification gap determines the ceiling, and the evolutionary diversity mechanism delays the diversity collapse without eliminating it. Source: Prompts Prompting.

Empirical validation via evolutionary self-improvement (DGM): The Darwin Gödel Machine replaces formal self-improvement proofs with empirical validation — evolutionary archive of past modifications, population-based search through code-level self-modifications, and fitness measured by benchmark performance. DGM improved Coder from 20.3% to 50.0% on SWE-bench Verified through iterative self-modification. This sidesteps the generation-verification gap by changing what "verification" means: instead of the model verifying its own outputs against a fixed standard, verification is empirical (does performance improve?) and historical (does the archive contain precedents?). The gap framework predicts this should work: empirical testing is a stronger verifier than self-evaluation, and evolutionary archives provide external reference points that prevent the diversity collapse that pure self-improvement suffers. See Can AI systems improve themselves through trial and error?.

The generator-discriminator-critique gap provides concrete evidence. Saunders et al. (2022) fine-tune large language models to write natural language critiques of model outputs. On topic-based summarization, model-written critiques help humans find flaws they would have missed. However, "we failed to find a clear trend showing critique performance catching up to discriminator performance, implying that larger models still have relevant knowledge they don't articulate as critiques." This is a direct instantiation of the generation-verification gap: the model can discriminate quality (verification) better than it can explain what's wrong (generation of critique). The gap persists at scale, suggesting it is structural rather than a matter of insufficient training. Source: Arxiv/Evaluations.

Inquiring lines that use this note as a source 19

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
28 direct connections · 215 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

self-improvement is bounded by the generation-verification gap — a formal quantity that scales with pretraining compute and vanishes for factual tasks