Can critique-only calls in LLMs exploit a measurable gap between generation and evaluation?
This explores whether prompting a model to only critique or evaluate — rather than generate — taps into a real, measurable difference between how well LLMs produce answers versus how well they judge them (the generation-verification gap), and whether that gap is something you can actually lean on.
This reads the question as: is there a structural daylight between generating and evaluating in LLMs, and can a critique-only pass exploit it? The corpus says the gap is real, formal, and load-bearing — but it cuts both ways. The clearest anchor is the finding that self-improvement is formally bounded by the generation-verification gap What stops large language models from improving themselves?: a model can't reliably fix its own output through metacognition alone, because every dependable correction needs something external to validate and enforce it. That framing flips the usual intuition. The gap isn't a free lunch you can mine by simply re-prompting the same model to critique — if verification were strictly easier than generation and internally trustworthy, models would bootstrap past their ceiling. They don't. So a critique-only call exploits the gap only to the degree the evaluation signal carries information the generation pass didn't already encode.
Why might evaluation still be genuinely easier than generation? Because generation is constrained by the architecture in ways judging is not. Token production is a smooth probabilistic flow toward the training distribution — it doesn't explore competing claims or counterpositions while writing Does LLM generation explore competing claims while producing text?, and it unfolds sequentially without any reflective pause that could change what comes next Does AI text generation unfold through temporal reflection?. Most sharply, autoregressive generation lacks a retraction primitive: once a token is emitted it can't be discarded, which is exactly the operation constraint-solving and error-correction depend on Why does autoregressive generation fail at constraint satisfaction?. A critique-only call sidesteps that — it reads a finished artifact and is free to point at the bad token without having had to avoid emitting it in real time. That's the mechanism by which the gap becomes measurable and, in principle, exploitable.
But the same architecture limits how far critique can carry. The model that evaluates is the model that generated — its reasoning lives in latent-state trajectories, and the surface text (including a critique) is only a partial interface onto that hidden process Where does LLM reasoning actually happen during generation?. When a task actually requires iterative refinement, models tend to pattern-match a plausible-looking answer rather than run the procedure Do large language models actually perform iterative optimization?, and errors in long delegated chains compound silently without ever plateauing Do frontier LLMs silently corrupt documents in long workflows?. A critique pass inherits those same blind spots. So the gap is exploitable for catching surface-detectable defects, not for transcending the model's underlying competence — which is precisely why What stops large language models from improving themselves? insists the binding fix has to come from outside.
The corpus does show one concrete place where critique becomes a usable signal rather than a self-grading loop: transforming a critique into a different representation. Few-shot prompting can convert negative feedback ("doesn't look good for a date") into a positive, retrievable preference ("prefer more romantic"), which then drives a downstream retrieval system Can language models bridge the gap between critique and preference?. That's the productive shape of "critique-only" — not the model judging its own answer for correctness, but critique being routed into an external mechanism that does the enforcing the model can't. The takeaway you might not have expected: the generation-evaluation gap is most worth exploiting when the critique leaves the model and lands in a system with the retraction, validation, or grounding the generator structurally lacks.
Sources 8 notes
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.
Token ordering in LLMs follows probabilistic selection without intervening reflection or revision. Human discourse gains meaning from temporal structure—time spent thinking changes what comes next—but AI text production lacks this duration-in-reflection despite appearing sequentially composed.
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.
Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
Few-shot LLM prompting can convert natural negative feedback like "doesn't look good for a date" into positive preferences like "prefer more romantic," enabling retrieval systems to find better-matching recommendations without fine-tuning.