Why does optimizing only quality cause model collapse in self-improvement loops?
This explores why a self-improvement loop that selects only for higher quality — keeping the best outputs and retraining on them — tends to eat its own diversity and degrade, rather than just getting better.
This explores why a self-improvement loop that optimizes only for quality ends up collapsing instead of compounding. The short version the corpus suggests: quality is only half of what keeps a loop healthy. When you reward only "better outputs," you quietly punish variety — and variety is the fuel the loop runs on. The clearest articulation is the idea that pure self-improvement is structurally circular Can models reliably improve themselves without external feedback?: a model that trains on its own filtered-for-quality outputs narrows toward the modes it already favors, a failure the note names diversity collapse, often alongside reward hacking, where the model learns to satisfy the quality signal rather than the underlying goal.
There's a deeper reason quality-only optimization is unreliable, not just narrowing: a model can only improve itself where it can verify better than it can generate. That generation–verification gap is described as a formal ceiling on self-improvement What limits how much models can improve themselves? What stops large language models from improving themselves? — and if your quality filter is itself the model's own judgment, you're optimizing against a flawed ruler. Push hard on it and you amplify the ruler's blind spots. This is why the reliable methods, as one synthesis puts it, all "smuggle in" something external What actually constrains large language models from self-improvement?: a past model version, a third-party judge, user corrections, or tool feedback that the loop can't fake.
The diversity side has a concrete mechanism worth knowing. Preference tuning's effect on diversity isn't uniform — it collapses lexical variety in code (where there's a single correct answer to converge on) but can actually increase it in creative writing Does preference tuning always reduce diversity the same way?. So "optimize only quality" is most corrosive exactly where quality looks like convergence: the loop keeps narrowing toward one answer-shape and loses the spread of attempts it needs to discover anything new. Relatedly, optimizing a crude quality signal can distort the model in ways that aren't about correctness at all — binary correct/wrong rewards degrade calibration, training the model to guess confidently because confident wrong answers aren't penalized Does binary reward training hurt model calibration?.
What's striking — and probably the thing you didn't know you wanted to know — is that the loops that *don't* collapse are the ones engineered to preserve disagreement or hard correctness rather than self-rated quality. Asymmetric self-play survives because a proposer is rewarded for generating *calibrated, varied* problems while the solver learns from majority-vote verification, so the system manufactures its own diversity instead of consuming it Can language models improve themselves without any external training data?. Self-improving transformers reach exponential length generalization by filtering on *verifiable* correctness (does this 100-digit sum check out?), not a soft quality score Can transformers improve exponentially by learning from their own correct solutions?. And there's a second-order trap: as a model trains on its own outputs, its own earlier errors leak into context and amplify non-linearly Do models fail worse when their own errors fill the context? — so a quality-only loop doesn't just stop improving, it can actively poison itself.
The through-line: collapse isn't caused by optimizing quality per se — it's caused by optimizing quality *alone*, using the model's own judgment as the standard. Healthy loops pair a quality signal with something it can't game: external verification, an explicit diversity/calibration term, or genuinely checkable correctness.
Sources 9 notes
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
LLMs cannot reliably improve themselves without external verification; metacognition must be externalized rather than learned. Alignment philosophy is shifting from preferentism to normative standards, but coherent values at scale include problematic self-valuation requiring utility engineering beyond output control.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.
Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.