Can co-evolved critics truly circumvent static evaluator limitations in self-improvement?

This explores whether critics that evolve in lockstep with the model they judge can escape the limits that doom a fixed evaluator — and the corpus suggests they shift the limit rather than abolish it.

This question is really asking whether a moving target — a critic trained alongside the generator instead of frozen in place — can dodge the wall that static evaluators hit in self-improvement. The corpus says co-evolution genuinely buys you something, but not escape from the underlying physics. The cleanest statement of those physics is the generation-verification gap What limits how much models can improve themselves?: a model can only improve itself to the degree it can judge an answer better than it can produce one. A critic that shares the generator's weights, biases, and blind spots doesn't widen that gap — and on factual tasks the gap collapses to nothing, meaning no critic, evolving or not, has anything extra to offer.

Where co-evolved critics clearly do help is in keeping the search alive. Static training loops tend to collapse: solutions narrow, diversity dies, and the model converges prematurely on its own confident habits. A critic embedded in the training loop counteracts exactly this, preserving exploration diversity rather than just nudging up test accuracy Do critique models improve diversity during training itself?. Systems like SERL push further, alternating a model between generator and judge roles and deriving reward from the consistency of its own rankings — climbing AlpacaEval win rates with no external signal at all Can models learn to judge themselves without external rewards?. And when numerical critics plateau, swapping them for critics that explain *why* an answer failed — natural-language critique instead of a scalar — breaks through ceilings that more scaling couldn't Can natural language feedback overcome numerical reward plateaus?.

But here's the catch the corpus keeps returning to: the methods that actually work tend to smuggle external anchors back in. Pure self-improvement is circular and stalls on diversity collapse and reward hacking; the reliable recipes quietly import past model versions, third-party judges, user corrections, or tool feedback Can models reliably improve themselves without external feedback?. AlphaLLM's critics look self-contained, but their signal comes from tree-search *outcomes* — structure that ranks paths by success — not from the model grading itself in a vacuum Can tree search replace human feedback in LLM training?. The Darwin Gödel Machine reaches open-ended improvement precisely by replacing the model's own self-assessment with empirical benchmarking against the world Can AI systems improve themselves through trial and error?. The common thread: the critic that escapes static limits is usually the one wired to something outside the model's own judgment.

There's also a deeper objection to the whole framing. Even a co-evolving critic, if humans designed its evaluation loop, is still *extrinsically* fixed — its metacognitive strategy doesn't adapt when the domain shifts. True circumvention, on this view, would require the agent to generate its own evolving evaluation criteria, not just an evolving score under a fixed rubric Can AI systems improve their own learning strategies?. The same worry runs through the alignment literature: metacognition has to be externalized rather than assumed-learned, because a model coherent enough to grade itself also acquires problematic self-valuation What actually constrains large language models from self-improvement?. Promising middle-ground work like Post-Completion Learning teaches models to internalize evaluation in unused sequence space at zero inference cost Can models learn to evaluate their own work during training? — but internalizing an evaluator is not the same as outgrowing the gap that limits it.

So the honest answer: co-evolved critics circumvent the failure modes of *static* evaluators — staleness, premature convergence, uninformative scalar rewards — without circumventing the deeper bound that any verifier sharing the generator's blind spots inherits its ceiling. The cautionary tale is imitation training, where a model can perfectly mimic a stronger model's confident style and fool evaluators while closing zero actual capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. A critic that evolves toward what *looks* good rather than what *is* good doesn't escape the limit — it hides it. The thing worth knowing here: the question isn't whether your critic is static or co-evolving, but whether it has access to information your generator doesn't.

Sources 11 notes

What limits how much models can improve themselves?

Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Can AI systems improve themselves through trial and error?

DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.

Can AI systems improve their own learning strategies?

Current self-improvement methods use extrinsic, fixed metacognitive loops designed by humans that fail under domain shift or capability changes. True self-improvement requires agents to generate their own adaptive metacognitive knowledge, planning, and evaluation—a gap confirmed as a neglected research area across neuro-symbolic AI.

What actually constrains large language models from self-improvement?

LLMs cannot reliably improve themselves without external verification; metacognition must be externalized rather than learned. Alignment philosophy is shifting from preferentism to normative standards, but coherent values at scale include problematic self-valuation requiring utility engineering beyond output control.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Can co-evolved critics truly circumvent static evaluator limitations in self-improvement?

Sources 11 notes

Next inquiring lines