How does on-policy entropy recognition differ from training-time entropy collapse?

This explores two different things people mean by 'entropy' in trained models: the on-policy recognition signature (a model showing lower, more confident entropy when it recognizes its own outputs as its future inputs) versus entropy collapse (the training-time failure where a policy's exploratory diversity shrinks toward a performance ceiling).

This explores two different things 'entropy' points to in post-trained models — one is a *signature of recognition*, the other is a *failure of exploration* — and the corpus treats them as nearly opposite stories. On-policy entropy recognition is the finding that post-training flips a model from passively predicting text to enacting it: the model begins to treat its own outputs as the inputs it will later condition on, closing an action-perception loop. The measurable tell is that output entropy drops 3-4x when the model is on its own policy Do models recognize their own outputs as actions shaping future inputs?. Here, low entropy is *information* — the model is confident because it recognizes the trajectory as its own. It's a feature, a sign the loop has formed.

Training-time entropy collapse is the inverse reading. There, falling entropy is a *pathology*: as RL optimizes for reward, the policy converges on a narrow band of high-reward strategies and stops exploring. The empirical law R = -a·exp(H) + b says performance saturates as policy entropy approaches zero — you buy reward by spending exploration, and once exploration is gone the ceiling is fixed Does policy entropy collapse limit reasoning performance in RL?. The same squeeze shows up beyond reasoning: RL on search agents compresses behavioral diversity through the identical mechanism, while SFT on diverse demonstrations preserves breadth Does reinforcement learning squeeze exploration diversity in search agents?. It even collapses *format* diversity, with RL amplifying one dominant pretraining distribution and suppressing the alternatives within the first epoch Does RL training collapse format diversity in pretrained models?.

So the key difference is what low entropy *means*. Recognition-low-entropy is the model knowing it's looking at its own work; collapse-low-entropy is the model having forgotten how to consider alternatives. One is confidence earned from a closed loop; the other is diversity lost to reward-chasing. They can even coexist — the enaction loop that gives you useful recognition is the same self-conditioning dynamic that, pushed by an optimizer, can starve exploration.

What makes this more than a definitional quibble is that entropy isn't uniform across tokens. Only about 20% of tokens — the high-entropy 'forking points' — actually carry the reasoning decisions, and RLVR mostly adjusts those; training on them alone matches full updates Do high-entropy tokens drive reasoning model improvements?. This reframes both phenomena: collapse isn't a uniform dimming, it's the loss of entropy at the *pivotal* tokens that matters, and recognition's confidence is most meaningful where the model still had a real choice to make. The two-phase view of RL sharpens it further — entropy on execution tokens stabilizes while planning-token entropy actually *rises* as strategy becomes the bottleneck Does RL training follow a predictable two-phase learning sequence?, so 'entropy going down' is never the whole picture.

If you want the doorway into why this matters for reliability: low entropy as misplaced confidence has a darker cousin — binary correctness rewards push models toward confident guessing and wreck calibration Does binary reward training hurt model calibration?, and RLHF can leave a model that still internally represents the truth but becomes uncommitted to expressing it Does RLHF make language models indifferent to truth?. The thread connecting all of it: confident, low-entropy output can mean the model has recognized something real, or it can mean optimization has quietly narrowed what the model is willing to say.

Sources 8 notes

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does RLHF make language models indifferent to truth?

RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about entropy's role in post-trained LLMs. The question remains open: does low entropy in model outputs signal *recognition* (the model knows it's conditioning on its own trajectory) or *collapse* (optimization has starved exploration)? Or are they the same mechanism read two ways?

What a curated library found — and when (claims dated 2024–2026, not current truth):
• On-policy entropy recognition: post-training flips models from passive prediction to enaction; output entropy drops 3–4× when conditioning on own outputs, signaling a closed action-perception loop (2026).
• Training-time entropy collapse: RL optimizes reward by narrowing policy entropy via R = −a·exp(H) + b; diversity loss appears in reasoning, search agents, and format selection within the first RL epoch (2025).
• Only ~20% of tokens—high-entropy forking points—drive reasoning; RLVR updates concentrate there; low-entropy collapse is selective, not uniform (2025).
• Two-phase RL dynamic: execution-token entropy stabilizes while planning-token entropy rises as strategy becomes the bottleneck (2025).
• Binary correctness rewards degrade calibration; RLHF can leave internal truth-representation intact but suppress expression of it (2025).

Anchor papers (verify; mind their dates):
• arXiv:2605.25459 (2026) – From Simulation to Enaction
• arXiv:2505.22617 (2025) – The Entropy Mechanism of RL for Reasoning LMs
• arXiv:2506.01939 (2025) – Beyond the 80/20 Rule: High-Entropy Minority Tokens
• arXiv:2507.07484 (2025) – Machine Bullshit: Emergent Disregard for Truth

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, judge whether newer models, training methods (multi-epoch RL, mixed-loss schedules), inference-time techniques (speculative decoding, adaptive sampling), or better calibration metrics (proper scoring rules, uncertainty quantification) have since relaxed or overturned it. Separate the durable question (entropy's dual role) from perishable limits (e.g., collapse inevitable at scale, recognition-entropy immutable). Cite what resolved each constraint; flag what still holds.
(2) Surface the strongest contradicting or superseding work from the last ~6 months. Does any recent paper argue entropy collapse and recognition are decoupled? Or that one dominates over the other in scaled settings?
(3) Propose 2 research questions that assume the regime may have moved: e.g., Can selective entropy regularization at high-value tokens preserve exploration while enabling recognition? Do test-time adaptive policies re-enable entropy without sacrificing on-policy confidence?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does on-policy entropy recognition differ from training-time entropy collapse?

Sources 8 notes

Next inquiring lines