Does critiquing errors teach deeper understanding than imitating correct answers?
Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
Supervised Fine-Tuning trains models to maximize the probability of a correct response given an instruction. Critique Fine-Tuning (CFT) trains models to maximize the probability of a high-quality critique given an instruction plus a noisy (flawed) response. The training objective is P(critique | query, flawed_response). At inference time, the trained model generates direct responses in the normal way — no critique is invoked.
The advantage is mechanistic: to write a good critique, the model must understand the problem at a structural level — not just recognize the correct answer pattern but identify precisely what is wrong with a given response and why. This requires engaging with failure modes, understanding the criteria for correctness, and reasoning about deviations from those criteria. SFT can succeed by learning to recognize the surface form of correct answers. CFT cannot succeed by surface matching alone.
The training data is efficiently generated: GPT-4o produces critiques for query-noisy-response pairs at scale. The cost is that at least 20% of critiques contain errors (acknowledged limitation). But even imperfect critique supervision outperforms correct-response imitation, which reveals how weak the imitation objective is at building understanding.
The key limitation is illuminating: CFT-trained models can critique other models' outputs but do not develop self-critique capability. The training objective creates a competence asymmetry — better at evaluating others, not better at evaluating themselves. This is consistent with Why do models trust their own generated answers?: the self-trust structural bias persists even after extensive critique training on others' outputs.
This connects to Does chain-of-thought reasoning reveal genuine inference or pattern matching?: both identify the same SFT failure mode. CFT addresses the root: instead of training on correct form, train on structured failure analysis.
Inquiring lines that use this note as a source 32
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes deliberate practice on your own errors more effective than copying others?
- How does execution-guided critique differ from abstract action evaluation?
- How does critique fine-tuning on one problem unlock broader reasoning?
- Does training on critiques of noisy responses produce deeper understanding than imitating correct ones?
- Can models learn better from critiquing errors than imitating correct responses?
- Can AI-generated explanations of errors teach as effectively as self-resolution?
- What distinguishes genuine understanding from correct output without coherent principles?
- What happens when confident wrong answers become more rewarded than uncertain correct ones?
- What are collider structures and why do they reveal reasoning errors?
- Can high test performance mask a complete absence of understanding?
- Why do more detailed rating systems sometimes improve learning from reviews?
- Do negative reviewers actually appear more intelligent or competent than positive ones?
- Does reflection training actually teach models to self-correct their mistakes?
- How should training incorporate external critique versus encouraging self-correction?
- Why does polished presentation substitute for deeper expert judgment?
- Why does critique training produce deeper understanding than imitation training?
- Why does external critique improve revision accuracy more than self-assessment?
- Why does external critique improve revision while internal self-assessment fails?
- Why do models trained on critique fail at self-critique despite strong other-model evaluation?
- Can adversarial critics force genuine reasoning the same way critique fine-tuning does?
- How does training on correct answer form differ mechanistically from training on failure analysis?
- Does critique training improve exploration diversity during model training or only test time?
- What happens when students encounter errors they cannot resolve through prompting alone?
- What makes mathematically confident but incorrect answers resemble valid solution shapes?
- Why does evaluating errors teach more than imitating correct responses?
- Why does adversarial training force deeper reasoning than surface imitation?
- How does metacognitive self-correction enable models to revise failed strategies?
- Why do students learn better from explanations than from solving problems from scratch?
- Does external critique guide revision better than internal self-assessment during model training?
- Why do structure-targeted training negatives fail to fix the underlying problem?
- Why does self-critique fail without external verification signals?
- What makes some training data teach brittle answers versus robust reasoning?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
SFT imitation is the failure; CFT is an alternative training objective that forces structural understanding over form imitation
-
Why do models trust their own generated answers?
Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.
CFT's self-critique limitation confirms structural self-trust bias persists even when critique competence is developed for other-model evaluation
-
Does supervised fine-tuning improve reasoning or just answers?
Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
CFT is the counter-strategy: instead of training on correct answer form (which raises scores without understanding), CFT trains on structured failure analysis (which requires understanding)
-
Do critique models improve diversity during training itself?
Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
complementary critique mechanism: AutoMathCritique uses critique to improve training-time exploration diversity; CFT uses critique-writing as the training signal itself; both treat critique as more than test-time quality filter
-
Can adversarial critics replace task-specific verifiers for reasoning?
Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.
parallel mechanism: RARO's adversarial critic forces genuine reasoning for the same reason CFT's critique objective does — discriminating expert from policy requires structural understanding, not surface pattern matching; both bypass pure imitation
-
Can reasoning emerge from expert demonstrations alone?
Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.
RARO's co-trained critic operationalizes the critique principle via adversarial RL: the critic component develops evaluation capability through the same structural-understanding mechanism that makes CFT work, but in a joint training loop rather than a separate training objective
-
Can reasoning improvement work without answer verification?
Explores whether RL-based reasoning training can extend beyond math and code to general domains like chemistry and law by replacing answer verification with a simpler signal based on reference answer likelihood.
VeriFree extends critique-based training to domains without verifiers: where CFT trains on structured critique of flawed responses, VeriFree conditions on reference answer likelihood to create reward signal without explicit verification — both bypass the requirement for deterministic answer checking that limits standard RL to math/code
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
- Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
- Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
- Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision
- Language Models Learn to Mislead Humans via RLHF
- Self-critiquing models for assisting human evaluators
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
Original note title
training to critique noisy responses produces deeper understanding than training to imitate correct responses