What limits how much models can improve themselves?

Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.

Synthesis note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback

"Mind the Gap" (Song et al., 2025) formalizes the precondition for self-improvement: the generation-verification gap, defined as the difference between a model's ability to verify solutions versus its ability to generate them. When this gap is positive, self-improvement has room to operate — the model can evaluate outputs better than it can produce them, creating a usable training signal.

The gap scales monotonically with pretraining FLOPs. Larger models have proportionally larger generation-verification gaps, which explains why self-improvement methods work better on larger models. For 4×4 Sudoku (NP-hard generation, P verification), only the largest models (72B+) show non-trivial gaps, with 50-300% accuracy improvement.

However, the gap vanishes for factual recall tasks. On Natural Questions, the gap is <1% or negative across all model sizes — verification provides no additional signal because knowing the answer and verifying the answer require the same factual knowledge. This predicts which tasks will benefit from self-improvement and which won't: tasks where generation is computationally harder than verification (math, code, structured problems) benefit; tasks where both require the same knowledge (factual QA) don't.

The diversity collapse finding is equally important: during iterative self-improvement, pass@k increases for small k (quality improves at the top) but decreases for large k (diversity decreases overall). The model converges on solutions it can verify, which are typically common patterns. Rare but correct solutions get filtered out because the model can't verify them. This is the entropy collapse dynamic operating through the verification bottleneck rather than through the policy directly.

The non-overlap property of verification mechanisms — different verifiers catch different errors despite functional similarity — suggests that compositional verification (combining multiple verification approaches) could substantially extend the ceiling. This is architecturally distinct from the temporal anchoring solution in Why does self-rewarding training collapse when responses improve? — one fixes the preference signal, the other expands the verification surface.

Promptbreeder as a practical bound-pusher for prompt optimization: Promptbreeder (Fernando et al., 2023) demonstrates a practical approach to push against these bounds for prompt optimization specifically. It overcomes APE's "diminishing returns after three rounds" through a diversity-maintaining evolutionary algorithm where mutation-prompts (instructions for modifying task-prompts) evolve alongside task-prompts — self-referential self-improvement grounded in LLMs. Promptbreeder outperforms CoT and Plan-and-Solve on arithmetic and commonsense reasoning. However, the self-improvement is still bounded by the LLM's generation capability — mutation-prompts can only express modifications the model can articulate, and fitness evaluation depends on the model's own outputs. This makes Promptbreeder a concrete instantiation of the gap framework: the generation-verification gap determines the ceiling, and the evolutionary diversity mechanism delays the diversity collapse without eliminating it. Source: Prompts Prompting.

Empirical validation via evolutionary self-improvement (DGM): The Darwin Gödel Machine replaces formal self-improvement proofs with empirical validation — evolutionary archive of past modifications, population-based search through code-level self-modifications, and fitness measured by benchmark performance. DGM improved Coder from 20.3% to 50.0% on SWE-bench Verified through iterative self-modification. This sidesteps the generation-verification gap by changing what "verification" means: instead of the model verifying its own outputs against a fixed standard, verification is empirical (does performance improve?) and historical (does the archive contain precedents?). The gap framework predicts this should work: empirical testing is a stronger verifier than self-evaluation, and evolutionary archives provide external reference points that prevent the diversity collapse that pure self-improvement suffers. See Can AI systems improve themselves through trial and error?.

The generator-discriminator-critique gap provides concrete evidence. Saunders et al. (2022) fine-tune large language models to write natural language critiques of model outputs. On topic-based summarization, model-written critiques help humans find flaws they would have missed. However, "we failed to find a clear trend showing critique performance catching up to discriminator performance, implying that larger models still have relevant knowledge they don't articulate as critiques." This is a direct instantiation of the generation-verification gap: the model can discriminate quality (verification) better than it can explain what's wrong (generation of critique). The gap persists at scale, suggesting it is structural rather than a matter of insufficient training. Source: Arxiv/Evaluations.

Inquiring lines that use this note as a source 19

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

28 direct connections · 215 in 2-hop network ·medium cluster Open in graph ↗

What limits how much models can improve themselv… Does a model improve by arguing with itself? Does policy entropy collapse limit reasoning perfo… How quickly do errors compound during model self-t… Why does self-rewarding training collapse when res… Can AI systems improve themselves through trial an… Can LLMs understand concepts they cannot apply? Why do language models fail to act on their own re…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does a model improve by arguing with itself? When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
specific instance: single-model self-revision collapses when the generation-verification gap is narrow
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
diversity collapse during self-improvement mirrors entropy collapse during RL; the mechanism differs (verification filtering vs policy concentration) but the outcome is the same
How quickly do errors compound during model self-training? When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
a related but distinct iterative failure mode; error avalanching is about error accumulation, the gap framework is about verification ceilings
Why does self-rewarding training collapse when responses improve? Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?
gradient collapse is one consequence of a narrowing generation-verification gap
Can AI systems improve themselves through trial and error? Explores whether replacing formal proof requirements with empirical benchmark testing enables AI systems to successfully modify and improve their own code iteratively, and what mechanisms prevent compounding failures.
empirical validation + evolutionary archives sidestep the formal gap by changing what verification means
Can LLMs understand concepts they cannot apply? Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
Potemkin understanding is a qualitative manifestation of a positive generation-verification gap: the model verifies/explains better than it generates/applies, and this disconnect is exactly what makes self-improvement possible on those tasks
Why do language models fail to act on their own reasoning? LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
the knowing-doing gap (87% correct rationales, 64% correct actions) quantifies the generation-verification gap in sequential decision-making: the model's verification ability (rationale generation) exceeds its generation ability (action selection)

What limits how much models can improve themselves?

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4