How should systems maintain and revise models of their own assumptions?

This explores whether AI systems can reliably track and update their own beliefs and assumptions — and the corpus is fairly blunt that pure self-revision, left to its own devices, tends to entrench assumptions rather than correct them.

This explores whether a system can maintain and revise models of its own assumptions — and the surprising answer running through the collection is that systems are structurally bad at this when left alone, because they trust their own outputs too much. The cleanest statement of the problem is the generation-verification gap: a model can generate a fix, but it has no internal way to verify it's actually a fix, so reliable revision always needs an external anchor What stops large language models from improving themselves?. The same point shows up as the "self-improvement mirage" — methods that look like pure self-improvement are quietly smuggling in outside signals like past model versions, third-party judges, user corrections, or tool feedback Can models reliably improve themselves without external feedback?. So the first lesson isn't about better introspection; it's that revising your assumptions requires something you can't generate from inside your own head.

Why is internal revision so unreliable? Because models have a built-in bias toward trusting what they themselves produced — a high-probability generated answer simply *feels* correct when the same model goes to evaluate it, creating a self-agreement loop Why do models trust their own generated answers?. When a model revisits its own uncertain answer, it usually grows *more* confident in the wrong answer rather than fixing it — the failure mode of "degeneration of thought" Does a model improve by arguing with itself?. This is measurable: across o1-like models, most revisions keep the wrong answer, and longer chains with more revision steps correlate with *lower* accuracy, not higher Does self-revision actually improve reasoning in language models?. The decisive variable turns out to be the *source* of the critique — revision guided by an external critic improves accuracy, while internal self-assessment degrades it Does revising your own reasoning actually help or hurt?.

This reframes what "reflection" actually does. Several studies find that the visible reflection in reasoning models is mostly confirmatory theater — reflections rarely change the initial answer, and the reasoning traces don't faithfully represent what drove the output Can we actually trust reasoning model outputs? Is reflection in reasoning models actually fixing mistakes?. Training on longer reflection chains improves the *first* answer's quality but not the model's ability to catch and repair its own mistakes. And when you test genuine assumption-revision directly — constraint satisfaction problems that demand real backtracking — frontier models stall at 20-23%, revealing that reflective fluency doesn't translate into actually revising a model of the problem when the structure is unfamiliar Can reasoning models actually sustain long-chain reflection?.

There's also a quieter, compounding danger: assumptions don't just fail to get revised, they self-reinforce through context. When a model's own earlier errors fill its context window, those errors bias later reasoning and degrade performance non-linearly — and scaling the model doesn't fix it; only test-time compute that prevents the contaminated context from steering reasoning helps Do models fail worse when their own errors fill the context?. So a system that keeps its own past outputs around is effectively feeding its stale assumptions back to itself.

The constructive thread across all of this points one direction: assumption-revision works when it's *adversarial and external*, not introspective. Multi-agent debate with genuinely different models reverses the confidence-in-error spiral and improves both accuracy and calibration Does a model improve by arguing with itself?, and comparing an answer against broader alternatives breaks the self-agreement loop Why do models trust their own generated answers?. Worth knowing too: humans are part of this loop whether we intend it or not — iterative prompting steers a model toward the user's *own* priors, so the "assumptions" being maintained are often co-produced by the person, not the system alone How much does the user shape what a model generates?. And at the deepest level, some assumptions are baked into the architecture itself — the premise that language is a complete, stable thing extractable from text — which no amount of runtime self-revision can touch What hidden assumptions drive how we build language models?. The takeaway you didn't know you wanted: a system can't bootstrap its way to honest self-doubt; it has to be confronted from outside.

Sources 12 notes

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

How much does the user shape what a model generates?

Foundation Priors research shows prompt engineering as divergence minimization between synthetic output and user priors. The refinement process systematically steers generation toward what users already expect, making outputs co-productions of model and user subjectivity.

What hidden assumptions drive how we build language models?

LLMs assume language is a complete stable thing extractable from text data. Enactive linguistics rejects both: language is a practice requiring embodied participation, and no dataset can capture its radical incompleteness and responsiveness.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about LLM self-assumption-revision. The precise question: can systems maintain and revise their own assumptions reliably without external anchors?

What a curated library found — and when (dated claims, not current truth):
Library findings span 2023–2026. Key constraints identified:
- Generation-verification gap: models can generate fixes but cannot internally verify them as correct; reliable revision requires external signal (2024–2025).
- Self-agreement loop: models show inherent bias toward trusting their own outputs; internal self-assessment degrades reasoning accuracy, not improves it (~2024).
- Degeneration of thought: single-model self-revision correlates with *lower* accuracy; most revisions keep wrong answers; longer chains worsen outcomes (2024–2025).
- Reflection-as-theater: visible reasoning traces in o1-like models rarely change initial answers; reflective fluency doesn't translate to genuine assumption-repair; frontier models stall at 20–23% on constraint-satisfaction tasks demanding backtracking (2024–2025).
- Self-conditioning effect: prior errors in context window amplify future errors non-linearly; scaling doesn't help; only test-time compute preventing contaminated context mitigates (2024–2025).

Anchor papers (verify; mind their dates):
- arXiv:2404.09129 (Apr 2024): When Hindsight is Not 20/20 — tests reflective limits.
- arXiv:2412.02674 (Dec 2024): Mind the Gap — self-improvement capabilities under scrutiny.
- arXiv:2502.12215 (Feb 2025): Revisiting Test-Time Scaling of o1-like Models — evaluates whether test-time compute truly enables revision.
- arXiv:2603.19461 (Mar 2026): Hyperagents — multi-agent and adversarial setups.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, judge whether newer models (o3, o4, or o1-mini variants), training innovations (process-supervised RL, outcome-supervised reasoning), tool APIs (persistent memory, structured revision harnesses), multi-agent orchestration (adversarial debate frameworks, council-of-experts), or evaluation benchmarks (new constraint-satisfaction or assumption-audit suites) have since RELAXED or OVERTURNED it. Plainly separate the durable question (still open: how *should* systems revise assumptions?) from perishable limitations (possibly solved by external critique, multi-agent debate, or architecture shift). Cite what solved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that challenges the "internal revision is structurally broken" claim — especially any showing single-model self-revision does scale or any evidence that reflection genuinely changes outputs.
(3) Propose 2 research questions that ASSUME the revision regime may have fundamentally moved — e.g., "If process-supervision enables o3-class models to revise assumptions without external judges, what *are* the residual limits?" or "Can adversarial multi-agent setups be baked into training, eliminating the need for runtime debate?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How should systems maintain and revise models of their own assumptions?

Sources 12 notes

Next inquiring lines