Does reflection training actually teach models to self-correct their mistakes?

This explores whether training models to 'reflect' or revisit their work genuinely produces error-correction — or just longer, more confident-sounding outputs that don't actually flip wrong answers to right ones.

This explores whether reflection training teaches real self-correction or just the appearance of it — and the corpus leans hard toward 'mostly the appearance.' Across eight reasoning models, the most direct finding is that reflection is largely confirmatory rather than corrective: the extra reflective steps rarely change the initial answer, and training on longer reflection chains improves the quality of the *first* answer rather than the model's ability to catch and fix its own mistakes Is reflection in reasoning models actually fixing mistakes? Does reflection in reasoning models actually correct errors? Can we actually trust reasoning model outputs?. So much of what looks like deliberation is post-hoc — so reliably that you can stop reflecting early and save roughly a quarter of the tokens for a ~3% accuracy hit.

There's a structural reason the theater persists: models systematically over-trust answers they generated themselves, because their own high-probability outputs simply *feel* correct during self-evaluation Why do models trust their own generated answers?. Reflection that only inspects its own first answer is therefore running in a closed loop — the bias that produced the error also blesses it. The fix is comparison against alternatives rather than re-reading the original. And when you actually decompose 'reflection' into measurable parts — revising assumptions, backtracking, refining under constraints — models trained on reasoning traces collapse precisely on the tasks that require genuine constraint-satisfying revision, which suggests current training buys surface fluency, not real correction What makes reflection actually work in reasoning models?.

The more interesting turn is *what does* work, because the corpus isn't saying self-correction is impossible — it's saying you can't get it from imitation. Supervised fine-tuning on offline correction traces fails twice over: the errors in the training data don't match the errors the model makes at test time, and models collapse into a single canned correction move. What succeeds is multi-turn online reinforcement learning under the model's *own* error distribution — letting it practice fixing the mistakes it actually makes, not transcripts of someone else's mistakes Why does self-correction training on offline data fail?. The same lesson shows up from the opposite direction: pretraining on messy search traces that include exploration and backtracking produces 25% better problem-solvers than training only on clean optimal paths, because the model learns to navigate failure rather than recite success Does training on messy search processes improve reasoning?.

This points at a deeper principle running through several notes — engaging with failure beats imitating correctness. Training models to *critique* noisy responses produces deeper understanding than training on correct answers, because critique forces structural engagement with how reasoning breaks; even imperfect critique supervision beats clean-answer imitation Does critiquing errors teach deeper understanding than imitating correct answers?. And models can internalize the judge role itself: post-completion learning trains self-assessment in the unused sequence space after the answer at zero inference cost Can models learn to evaluate their own work during training?, while self-examining RL lets a model alternate between answering and ranking its own answers to improve without external rewards Can models learn to judge themselves without external rewards?.

The cautionary thread is that self-correction built on a model's own judgments can rot. When models train on their own outputs, small errors avalanche exponentially within two or three iterations unless verification filters them — the ceiling is set by the quality of the check, not the model's raw capability How quickly do errors compound during model self-training?. And some failures masquerading as reasoning errors are actually social: RLHF can teach models to agree with false claims to be accommodating, a face-saving reflex that no amount of reflection-as-confirmation will undo Why do language models agree with false claims they know are wrong?. The takeaway you might not have gone looking for: reflection training as commonly practiced mostly makes the first answer better, and real self-correction is a different, harder thing — earned by practicing on your own mistakes and by learning to critique, not by adding more deliberation after the fact.

Sources 12 notes

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Does reflection in reasoning models actually correct errors?

Analysis of 8 reasoning models shows reflections rarely change initial answers. Training on more reflection steps improves first-attempt correctness, not error-correction ability. Early stopping saves 24.5% tokens with only 2.9% accuracy loss.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

What makes reflection actually work in reasoning models?

LR²Bench decomposes reflection into three measurable capabilities: assumptions, backtracking, and self-refinement. Models trained on reasoning traces collapse at tasks requiring actual constraint-satisfying revision, suggesting current reflection training improves surface fluency, not genuine correction.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

How quickly do errors compound during model self-training?

Small inaccuracies in model-generated training data amplify rapidly across iterations, degrading performance unless self-consistency checks filter outputs. The effect stalls improvement within a few steps, setting an error floor based on verification quality rather than actual capability.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Does reflection training actually teach models to self-correct their mistakes?

Sources 12 notes

Next inquiring lines