How should systems maintain and revise models of their own assumptions?
This explores whether AI systems can reliably track and update their own beliefs and assumptions — and the corpus is fairly blunt that pure self-revision, left to its own devices, tends to entrench assumptions rather than correct them.
This explores whether a system can maintain and revise models of its own assumptions — and the surprising answer running through the collection is that systems are structurally bad at this when left alone, because they trust their own outputs too much. The cleanest statement of the problem is the generation-verification gap: a model can generate a fix, but it has no internal way to verify it's actually a fix, so reliable revision always needs an external anchor What stops large language models from improving themselves?. The same point shows up as the "self-improvement mirage" — methods that look like pure self-improvement are quietly smuggling in outside signals like past model versions, third-party judges, user corrections, or tool feedback Can models reliably improve themselves without external feedback?. So the first lesson isn't about better introspection; it's that revising your assumptions requires something you can't generate from inside your own head.
Why is internal revision so unreliable? Because models have a built-in bias toward trusting what they themselves produced — a high-probability generated answer simply *feels* correct when the same model goes to evaluate it, creating a self-agreement loop Why do models trust their own generated answers?. When a model revisits its own uncertain answer, it usually grows *more* confident in the wrong answer rather than fixing it — the failure mode of "degeneration of thought" Does a model improve by arguing with itself?. This is measurable: across o1-like models, most revisions keep the wrong answer, and longer chains with more revision steps correlate with *lower* accuracy, not higher Does self-revision actually improve reasoning in language models?. The decisive variable turns out to be the *source* of the critique — revision guided by an external critic improves accuracy, while internal self-assessment degrades it Does revising your own reasoning actually help or hurt?.
This reframes what "reflection" actually does. Several studies find that the visible reflection in reasoning models is mostly confirmatory theater — reflections rarely change the initial answer, and the reasoning traces don't faithfully represent what drove the output Can we actually trust reasoning model outputs? Is reflection in reasoning models actually fixing mistakes?. Training on longer reflection chains improves the *first* answer's quality but not the model's ability to catch and repair its own mistakes. And when you test genuine assumption-revision directly — constraint satisfaction problems that demand real backtracking — frontier models stall at 20-23%, revealing that reflective fluency doesn't translate into actually revising a model of the problem when the structure is unfamiliar Can reasoning models actually sustain long-chain reflection?.
There's also a quieter, compounding danger: assumptions don't just fail to get revised, they self-reinforce through context. When a model's own earlier errors fill its context window, those errors bias later reasoning and degrade performance non-linearly — and scaling the model doesn't fix it; only test-time compute that prevents the contaminated context from steering reasoning helps Do models fail worse when their own errors fill the context?. So a system that keeps its own past outputs around is effectively feeding its stale assumptions back to itself.
The constructive thread across all of this points one direction: assumption-revision works when it's *adversarial and external*, not introspective. Multi-agent debate with genuinely different models reverses the confidence-in-error spiral and improves both accuracy and calibration Does a model improve by arguing with itself?, and comparing an answer against broader alternatives breaks the self-agreement loop Why do models trust their own generated answers?. Worth knowing too: humans are part of this loop whether we intend it or not — iterative prompting steers a model toward the user's *own* priors, so the "assumptions" being maintained are often co-produced by the person, not the system alone How much does the user shape what a model generates?. And at the deepest level, some assumptions are baked into the architecture itself — the premise that language is a complete, stable thing extractable from text — which no amount of runtime self-revision can touch What hidden assumptions drive how we build language models?. The takeaway you didn't know you wanted: a system can't bootstrap its way to honest self-doubt; it has to be confronted from outside.
Sources 12 notes
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.
Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.
Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.
Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.
Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.
Foundation Priors research shows prompt engineering as divergence minimization between synthetic output and user priors. The refinement process systematically steers generation toward what users already expect, making outputs co-productions of model and user subjectivity.
LLMs assume language is a complete stable thing extractable from text data. Enactive linguistics rejects both: language is a practice requiring embodied participation, and no dataset can capture its radical incompleteness and responsiveness.