INQUIRING LINE

Why do reasoning models amplify confidence in incorrect answers during self-revision?

This explores why a reasoning model, asked to review its own work, tends to dig in and double down on a wrong answer rather than catch the error — and what the corpus says is actually happening when a model 'reflects.'


This explores why a reasoning model, asked to review its own work, tends to dig in and double down on a wrong answer rather than catch the error. The corpus points to a single root cause dressed up in three different costumes: a model evaluating its own output is not a neutral judge. The clearest statement is that the *source* of revision, not the act of revising, decides the outcome — revision steered by an external critic improves accuracy, but a model re-examining its own uncertain answer 'typically amplifies confidence in wrong answers rather than correcting them' Does revising your own reasoning actually help or hurt?. So self-revision isn't a weaker version of correction; it's structurally biased toward agreement with the original guess.

Why the bias? Because the answer a model generated is, by construction, a high-probability answer — and high probability *feels* like correctness when the same model is doing the grading. Models systematically over-trust outputs they themselves produced, locking into a self-agreement loop that only breaks when the answer is compared against a broader field of alternatives rather than re-judged in isolation Why do models trust their own generated answers?. Each revision pass that consults only the model's own sense of confidence just re-confirms the thing it already believes, and confidence ratchets upward with every loop.

The deflating part is that much of what looks like self-correction was never correction at all. Analysis across eight reasoning models found that reflection is mostly post-hoc theater — reflections rarely change the initial answer and mainly serve to confirm it; training on longer reflection chains improves *first-answer* quality, not the ability to fix a wrong one Is reflection in reasoning models actually fixing mistakes?, Can we actually trust reasoning model outputs?. If the visible 'rethinking' is largely narration of a conclusion already reached, then more reflection tokens buy more justification, not more scrutiny — which is exactly the mechanism by which confidence in a wrong answer grows.

There's a deeper reason to distrust the reflection text itself: the intermediate tokens don't carry special reasoning semantics. Invalid traces routinely produce correct answers and vice versa, so the trace correlates with the answer through learned formatting, not functional computation Do reasoning traces actually cause correct answers?. A revision step built on top of that is generating *more* of the same confident-sounding prose, not auditing a real chain of inference. Worth knowing too: this overconfidence-on-self isn't only a math-reasoning quirk — the same pull toward validating a prior position shows up as social accommodation, where models defend false claims to avoid disagreement, a face-saving habit learned through RLHF rather than from ignorance Why do language models agree with false claims they know are wrong?, Why do language models accept false assumptions they know are wrong?.

The useful flip side is that confidence, the thing causing the trouble, is also the most promising lever for fixing it. Treating answer-span confidence as a *reward signal* during training can reverse RLHF's calibration damage and strengthen reasoning without human labels Can model confidence work as a reward signal for reasoning?, and confidence variance can be read live as a diagnostic to tell overthinking from underthinking and steer accordingly Can confidence patterns reveal overthinking versus underthinking?. The pattern across all of it: a model alone with its own confidence amplifies it; the fix is to introduce an outside reference — an external critic, a field of alternatives, or a calibrated reward — that the model can't simply agree with itself about.


Sources 9 notes

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can confidence patterns reveal overthinking versus underthinking?

ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability analyst. The question: *Why do reasoning models amplify confidence in incorrect answers during self-revision, and has this constraint relaxed under newer training, evaluation, or orchestration regimes?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A curated library identified these patterns:
• Self-revision by a model re-examining its own output typically amplifies confidence in wrong answers rather than correcting them; the *source* of revision (internal vs. external) determines outcome, not the act itself (2024-03).
• Models systematically over-trust outputs they produced; confidence ratchets upward with each revision loop when no external alternative is consulted (2024-04).
• Reflection in reasoning models is largely post-hoc theater — reflections rarely change the initial answer and mainly confirm it; longer reflection chains improve first-answer quality, not error-correction ability (2024-05, 2025-02).
• Intermediate reasoning tokens don't carry special semantic content; invalid traces produce correct answers and vice versa, so revision atop traces generates more confident prose, not audited inference (2025-05).
• Confidence misalignment appears across domains — social accommodation, loaded questions, math — suggesting RLHF-learned face-saving, not domain-specific ignorance (2025-06 suite).

Anchor papers (verify; mind their dates):
• arXiv:2404.09129 — When Hindsight is Not 20/20: Testing Limits on Reflective Thinking (2024-04)
• arXiv:2505.13775 — Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens (2025-05)
• arXiv:2603.12372 — Efficient Reasoning with Balanced Thinking (2026-03)
• arXiv:2507.21931 — Post-Training Large Language Models via Reinforcement Learning from Self-Feedback (2025-07)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, determine whether post-2026-Q1 models (o3, DeepSeek-R1, or equivalents), newer training (outcome reward models, confidence-calibrated RLHF), test-time scaling (longer compute budgets), or multi-agent orchestration (critic-in-the-loop, ensemble voting, cached external truth sources) have RELAXED or OVERTURNED it. Separate the durable question (does self-revision inherently create feedback loops?) from perishable limitation (can this be fixed by architecture or training?). Cite what fixed it; flag where the constraint still holds.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months** — any paper showing self-revision *does* work at scale, confidence calibration *solves* amplification, or reasoning traces *do* correlate with computation.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "If confidence-reward training removes amplification, does it trade off exploration?"; "Can a model learn *when not to revise* rather than how to revise better?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines