Why does most refinement in iterative models maintain answers rather than improve them?

This explores why an AI that revises its own answer over multiple passes usually ends up restating the same answer instead of making it better — and what the corpus says is missing.

This explores why iterative self-revision in LLMs tends to preserve an answer rather than genuinely improve it. The corpus converges on a single root cause: a model can generate a revision, but it can't reliably tell whether the revision is actually better. Self-improvement is formally bounded by what researchers call the generation-verification gap — every reliable fix requires something outside the model to validate and enforce it, and no amount of metacognition lets the model escape that ceiling on its own What stops large language models from improving themselves?. Without an external check, 'refine' collapses into 'rephrase.'

Underneath that, several notes suggest the model isn't really doing the iterative work we imagine it's doing. When asked to run iterative numerical methods, LLMs don't actually execute the procedure step by step — they recognize a problem as template-similar to something memorized and emit a plausible-looking value, a failure that persists across model scale Do large language models actually perform iterative optimization?. Extended chains of thought make this worse in a telling way: reasoning variants produce *more text* on constraint-bound numerical tasks without producing *more computation*, and so don't systematically beat plain models Do reasoning models actually beat standard models on optimization?. So a revision pass adds words around the same memorized guess rather than recomputing toward a better one.

When models do change something, it's often the surface. Supervised fine-tuning teaches outputs to *look* correct — clean JSON, valid identifiers, expected sections — without making them physically feasible, because the model learns the surface features of good solutions rather than the reasoning to construct them Does supervised fine-tuning actually improve reasoning on optimization problems?. That's the mechanism of 'maintain, don't improve' in miniature: the refinement edits the packaging, not the substance. And there's a deeper version — a model can hold all the linearly-decodable features a task needs while its internal organization is fractured, so its answer can be 'right' on the metric yet brittle, with nothing structured inside to revise toward Can models be smart without organized internal structure?.

There's also a noise problem. Sequential revision reproduces the same failure as token-level overthinking — it accumulates noise across iterations with no guarantee of improvement, just at a slower tempo Do iterative refinement methods suffer from overthinking?. Reasoning models compound this by wandering and switching paths prematurely, abandoning promising directions rather than carrying them forward Why do reasoning models abandon promising solution paths?. So even when a better answer is reachable, the refinement loop is as likely to drift away from it as toward it.

The interesting turn is what *does* break the stalemate, and it's the same thing in every case: an external signal. The Darwin Gödel Machine gets real, open-ended improvement precisely by replacing introspection with empirical benchmarking and keeping an archive of variants — it improves because reality grades each attempt Can AI systems improve themselves through trial and error?. The ACE framework gets gains by treating context as an evolving playbook with structured incremental updates instead of full rewrites, which stops each iteration from erasing what the last one learned Can context playbooks prevent knowledge loss during iteration?. The throughline worth taking away: refinement maintains rather than improves whenever the loop has no grader outside itself — give it an empirical test, a preserved memory, or a verifier, and 'revise' starts to mean something.

Sources 9 notes

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do reasoning models actually beat standard models on optimization?

Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do iterative refinement methods suffer from overthinking?

Sequential revision methods share the same failure architecture as token-level overthinking: they accumulate noise without guaranteed improvement. Progressive Draft Refinement avoids this by compressing memory between iterations, outperforming longer reasoning traces at matched compute.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Can AI systems improve themselves through trial and error?

DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.

Can context playbooks prevent knowledge loss during iteration?

The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher testing whether iterative self-refinement in LLMs truly maintains answers or whether recent advances (models, training methods, multi-agent orchestration, external verifiers) have begun to close the generation-verification gap. The question: *Can LLMs escape surface-level revision when given no external grader?*

What a curated library found — and when (dated claims, not current truth):
Findings span Oct 2023–Mar 2026. A library of recent work converges on these constraints:

• The generation-verification gap is fundamental: models can revise but cannot reliably judge whether a revision is better without external validation (arXiv:2412.02674, ~2024-12).
• Extended reasoning (chain-of-thought variants) produces *more text* without proportionally more computation on constraint-bound tasks, so refinement adds words around memorized answers rather than recomputing (arXiv:2505.20296, ~2025-05).
• Reasoning models wander and switch paths prematurely, abandoning promising directions instead of systematically improving them (arXiv:2505.20296, ~2025-05).
• When models do revise, they often edit surface features—formatting, JSON validity—without improving physical feasibility or underlying reasoning (SFT observations, ~2024-25).
• Open-ended improvement emerges only when loops include external signals: empirical benchmarking (Darwin Gödel Machine, arXiv:2505.22954, ~2025-05), preserved memory (Agentic Context Engineering, arXiv:2510.04618, ~2025-10), or structured incremental updates.

Anchor papers (verify; mind their dates):
- arXiv:2412.02674 (Mind the Gap, 2024-12): self-improvement ceiling
- arXiv:2505.22954 (Darwin Gödel Machine, 2025-05): external grading unlocks improvement
- arXiv:2510.04618 (Agentic Context Engineering, 2025-10): memory structures enable revision
- arXiv:2505.20296 (Reasoning LLMs are Wandering, 2025-05): path abandonment failure

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, determine whether newer models (o1-style reasoning chains, 2025–2026 deployments), training advances (reinforcement learning from step-wise reasoning trajectories, arXiv:2510.25992), or orchestration (multi-agent memory, persistent context, external verifiers embedded in sampling loops) have since relaxed or overturned it. Separate the durable question (*Can a model revise without external signal?*) from the perishable limitation (*Current models cannot*). Flag what—if anything—has resolved a constraint.
(2) Surface the strongest *contradicting* or *superseding* work from the last 6 months (roughly Nov 2025–Apr 2026). Does any recent paper show genuine end-to-end improvement without external grading?
(3) Propose 2 research questions that assume the regime *has* shifted: e.g., "If Foundation Priors (arXiv:2512.01107) allow models to refactor internal representations mid-inference, can they detect and repair their own reasoning drift?" or "Does multi-query retrieval (RAG-R1, arXiv:2507.02962) combined with step-wise RL reintroduce introspective improvement?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does most refinement in iterative models maintain answers rather than improve them?

Sources 9 notes

Next inquiring lines