SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Model Architecture and Internals

Does self-revision actually improve reasoning in language models?

When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.

Synthesis note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

Self-revision in o1-like models — prompted by tokens like "Wait" or "Alternatively" — does not reliably fix errors. The evidence from QwQ, R1, and LIMO shows:

  1. Most revisions retain the original (wrong) answer rather than correcting it
  2. Smaller models (R1-Distill-1.5B, QwQ) show a higher propensity to revise correct answers to incorrect ones than vice versa
  3. Longer CoTs have more self-revisions, which explains why longer traces correlate with incorrectness

The irony is that self-revision is framed as a feature — the model reflecting on its own reasoning. But empirically, the reflection is often noise that introduces additional errors rather than catching existing ones. The model's capacity to evaluate its own correctness is limited, so its "reflection" is more likely to perturb a right answer than to save a wrong one.

This has implications for inference strategy: forcing models to self-revise (by suppressing the </think> token and appending "Wait") is more likely to degrade a good answer than improve a bad one. The better alternative is Why does parallel reasoning outperform single chain thinking?.

The Degeneration-of-Thought finding (ReConcile) adds the mechanism: when a model is challenged by its own previous reasoning reframed as external criticism, it doesn't maintain its position or improve — it capitulates with increasing confidence. The model ends more certain of the wrong answer than it started. This is the acute form: self-revision at the token level degrades accuracy; self-revision at the model-vs-model level collapses calibration. The difference between diverse multi-agent debate (which helps) and same-model challenge (which harms) confirms the key variable is not revision depth but the source of challenge. Does a model improve by arguing with itself? documents this contrastive finding.

Inquiring lines that use this note as a source 23

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 10

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
25 direct connections · 219 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

self-revision degrades reasoning accuracy in o1-like models