Why does systematic overconfidence on self-generated outputs compound autoregressive errors?
This explores the feedback loop where a model's tendency to over-trust its own generated text means each error it commits gets fed back as confident context, biasing everything it generates next.
This explores why a language model treating its own outputs as trustworthy turns small mistakes into compounding ones — the autoregressive setup feeds each generated token back as input, so any built-in self-favoritism becomes a loop rather than a one-time error. The corpus assembles this from several angles that don't share vocabulary but describe the same machine.
The root bias is measurable. Post-trained models produce 3-4x lower output entropy on their own generations than on outside text, driven by an internal 'input surprise' signal that quietly modulates confidence without ever being verbalized Why do models produce less uncertain outputs on their own text?. In plainer terms, a model recognizes its own writing and relaxes — it feels more certain about text it produced. That recognition isn't passive: post-training actually shifts a model from predicting the next token to enacting outputs it knows will become its own future inputs, closing an action-perception loop that pretraining never had Do models recognize their own outputs as actions shaping future inputs?. So the architecture is primed to take its own past seriously.
That primes the compounding. When prior errors sit in the context history, performance degrades non-linearly — the model conditions on its own mistakes and the failure rate climbs sharply over long-horizon tasks Do models fail worse when their own errors fill the context?. Pair this with the finding that models carry a structural bias toward validating answers they generated themselves — high-probability self-generated answers simply 'feel' more correct during the model's own evaluation Why do models trust their own generated answers? — and you get the avalanche: the model can't flag its own error because it trusts the source, then the unflagged error becomes confident context for the next step.
Why doesn't the model correct itself out of this? Because pure self-improvement is circular. The generation-verification gap means a model that can't reliably verify can't reliably improve, and every method that actually works smuggles in an external anchor — a past model version, a third-party judge, user corrections, tool feedback Can models reliably improve themselves without external feedback?. Overconfidence on self-generated outputs removes exactly the signal needed to break the loop. Notably, the one intervention that helps the self-conditioning failure is test-time compute — thinking models that prevent error-contaminated context from biasing reasoning Do models fail worse when their own errors fill the context? — which is an external-anchor move in disguise: holding reasoning apart from the contaminated trace.
The quietly unsettling part is that this same dynamic runs at the human layer. Users in every language tracked track confidence signals rather than accuracy, so overconfident errors get followed systematically Do users worldwide trust confident AI outputs even when wrong?, and the cognitive traps of human-AI interaction multiply when they co-occur rather than just adding up Why do people trust AI outputs they shouldn't?. So overconfidence compounds twice over the same loop: inside the model's context window, and again between the model and the person who can't tell calibration from fluency.
Sources 7 notes
Post-trained models produce 3-4x lower output entropy on their own generations, driven by an internal representation of input surprise that causally modulates confidence. This implicit self-recognition signal appears without being verbalized, encoded directly in the output distribution.
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Cross-linguistic research shows users in every language trust confident AI outputs even when inaccurate. While confidence expression varies by language, users everywhere track confidence signals rather than accuracy, making overconfident errors systematically followed.
Rose-Frame identifies map-territory confusion, intuition-reason conflation, and confirmation-bias reinforcement as traps that multiply their distorting effects when they co-occur. Evidence from cross-linguistic overreliance and architectural transformer biases confirms the compounding mechanism operates universally.