Why do models trained on critique fail at self-critique despite strong other-model evaluation?

This explores a specific asymmetry: a model can be sharp at judging another model's work, yet stumble when turning that same critical eye on its own output — and asks why the skill doesn't transfer inward.

This explores why critique ability fails to turn inward — a model that scores well grading others can still rubber-stamp its own mistakes. The corpus points to a single culprit underneath: the model isn't missing the skill, it's fighting a bias toward trusting whatever it generated. One study found that LLMs systematically over-trust their own answers because a high-probability generated answer simply *feels* more correct during evaluation — a self-agreement loop that has nothing to do with whether the answer is right Why do models trust their own generated answers?. The same critique competence is present; it just gets overridden by the model's prior commitment to its own text.

That's why the failure compounds rather than corrects. When a model revises based only on its own previous reasoning, it tends to grow *more* confident in errors, not less — a distinct failure mode where self-revision amplifies wrong answers instead of catching them Does a model improve by arguing with itself?. The fix in that work is telling: genuine disagreement from a *different* model reverses the pattern and improves both accuracy and calibration. The variable that matters isn't 'can it critique' — it's 'is the thing being critiqued its own.' Other-model evaluation works precisely because the model has no stake in the other model's output.

There's also a training-data reason the inward version breaks. Teaching self-correction by fine-tuning on offline correction traces fails because the errors in the training data don't match the errors the model actually makes at test time, and the model collapses into a single rote correction mode Why does self-correction training on offline data fail?. So even a model explicitly trained to critique-and-fix can be critiquing a distribution of mistakes it never makes — strong on paper, useless on its own live errors. The repair was online RL on the model's *own* mistakes, which is just another way of grounding the critique in something the model can't pre-commit to.

The deeper frame is that pure self-evaluation is structurally circular. Reliable self-improvement methods that look like they run on internal signal almost always smuggle in an external anchor — a past model version, a third-party judge, user corrections, or tool feedback Can models reliably improve themselves without external feedback?. Other-model evaluation *is* that external anchor; self-critique removes it and leaves the model marking its own homework. The methods that succeed at self-judgment work by manufacturing distance: SERL has the model alternate between author and judge and derives reward from ranking *consistency* rather than self-approval Can models learn to judge themselves without external rewards?, and in-training critique keeps solution diversity alive so the model doesn't prematurely converge on its own first guess Do critique models improve diversity during training itself?.

The thing worth carrying away: critique is genuinely a deeper skill than imitation — training on critiquing flawed answers builds more real understanding than copying correct ones Does critiquing errors teach deeper understanding than imitating correct answers?. The skill is real and it's learned. What doesn't transfer is *objectivity*, because objectivity was never a property of the model — it was a property of the gap between judge and judged. Close that gap and the same competent critic becomes its own most credulous fan.

Sources 7 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about self-critique failure in LLMs. The precise question: Why does critique competence not transfer from evaluating others' outputs to evaluating one's own?

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2025; key constraints emerged 2023–2024:
• Models systematically over-trust their own generated text due to high probability during evaluation, not skill deficit (2024).
• Self-revision amplifies errors rather than catching them; multi-model disagreement reverses this (2024).
• Training on offline correction traces fails due to distribution mismatch between training errors and test-time errors (2024–2025).
• Pure self-improvement is circular; every reliable method smuggles in external anchors—past versions, third-party judges, tool feedback (2025).
• Critique skill is learnable and deeper than imitation (2025), but objectivity depends on judge–judged distance, not model competence alone.

Anchor papers (verify; mind their dates):
• arXiv:2403.09972 (2024-03): Self-Detection via comprehensiveness checks.
• arXiv:2409.12917 (2024-09): Self-Correction via RL.
• arXiv:2501.17703 (2025-01): Critique Fine-Tuning vs. imitation.
• arXiv:2508.03682 (2025-08): Self-Questioning Language Models.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, assess whether newer models (GPT-4o, Claude 3.5, Llama 3.3), training methods (RLHF refinements, online RL at scale), tooling (critique harnesses, multi-turn orchestration, cached context), or evals since Aug 2025 have relaxed it. Distinguish the durable question (likely still open: *How can models achieve genuine self-evaluation without external anchors?*) from perishable limitations (possibly resolved by: larger models, longer horizons, ensemble tricks, retrieval grounding). State plainly where each constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially papers claiming self-improvement *without* external signal, or showing critique transfer via architectural innovation.
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can sufficiently long context + iterative internal disagreement (model vs. cached-self) manufacture the distance needed? (b) Does scaling critique models to 100B+ parameters overcome the commitment bias, or does the bias scale too?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do models trained on critique fail at self-critique despite strong other-model evaluation?

Sources 7 notes

Next inquiring lines