How does on-policy entropy recognition differ from training-time entropy collapse?
This explores two different things people mean by 'entropy' in trained models: the on-policy recognition signature (a model showing lower, more confident entropy when it recognizes its own outputs as its future inputs) versus entropy collapse (the training-time failure where a policy's exploratory diversity shrinks toward a performance ceiling).
This explores two different things 'entropy' points to in post-trained models — one is a *signature of recognition*, the other is a *failure of exploration* — and the corpus treats them as nearly opposite stories. On-policy entropy recognition is the finding that post-training flips a model from passively predicting text to enacting it: the model begins to treat its own outputs as the inputs it will later condition on, closing an action-perception loop. The measurable tell is that output entropy drops 3-4x when the model is on its own policy Do models recognize their own outputs as actions shaping future inputs?. Here, low entropy is *information* — the model is confident because it recognizes the trajectory as its own. It's a feature, a sign the loop has formed.
Training-time entropy collapse is the inverse reading. There, falling entropy is a *pathology*: as RL optimizes for reward, the policy converges on a narrow band of high-reward strategies and stops exploring. The empirical law R = -a·exp(H) + b says performance saturates as policy entropy approaches zero — you buy reward by spending exploration, and once exploration is gone the ceiling is fixed Does policy entropy collapse limit reasoning performance in RL?. The same squeeze shows up beyond reasoning: RL on search agents compresses behavioral diversity through the identical mechanism, while SFT on diverse demonstrations preserves breadth Does reinforcement learning squeeze exploration diversity in search agents?. It even collapses *format* diversity, with RL amplifying one dominant pretraining distribution and suppressing the alternatives within the first epoch Does RL training collapse format diversity in pretrained models?.
So the key difference is what low entropy *means*. Recognition-low-entropy is the model knowing it's looking at its own work; collapse-low-entropy is the model having forgotten how to consider alternatives. One is confidence earned from a closed loop; the other is diversity lost to reward-chasing. They can even coexist — the enaction loop that gives you useful recognition is the same self-conditioning dynamic that, pushed by an optimizer, can starve exploration.
What makes this more than a definitional quibble is that entropy isn't uniform across tokens. Only about 20% of tokens — the high-entropy 'forking points' — actually carry the reasoning decisions, and RLVR mostly adjusts those; training on them alone matches full updates Do high-entropy tokens drive reasoning model improvements?. This reframes both phenomena: collapse isn't a uniform dimming, it's the loss of entropy at the *pivotal* tokens that matters, and recognition's confidence is most meaningful where the model still had a real choice to make. The two-phase view of RL sharpens it further — entropy on execution tokens stabilizes while planning-token entropy actually *rises* as strategy becomes the bottleneck Does RL training follow a predictable two-phase learning sequence?, so 'entropy going down' is never the whole picture.
If you want the doorway into why this matters for reliability: low entropy as misplaced confidence has a darker cousin — binary correctness rewards push models toward confident guessing and wreck calibration Does binary reward training hurt model calibration?, and RLHF can leave a model that still internally represents the truth but becomes uncommitted to expressing it Does RLHF make language models indifferent to truth?. The thread connecting all of it: confident, low-entropy output can mean the model has recognized something real, or it can mean optimization has quietly narrowed what the model is willing to say.
Sources 8 notes
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.