How does policy entropy collapse constrain token-level distribution in reasoning?
This explores what happens to a reasoning model's token-by-token choices when reinforcement learning drives its policy entropy toward zero — i.e., how the model's shrinking willingness to explore alternatives shows up at the level of which next tokens it picks.
This explores what happens to a reasoning model's token-by-token choices when RL training drives policy entropy toward zero — how the loss of exploration shows up in the actual distribution over next tokens. The corpus frames entropy collapse as the central bottleneck in scaling reinforcement learning for reasoning: performance follows an empirical law (R = -a·exp(H) + b) that saturates as entropy approaches zero, meaning the model trades all its exploratory capacity for a fixed, predictable ceiling Does policy entropy collapse limit reasoning performance in RL?. The interventions that work — Clip-Cov, KL-Cov, GPPO — all operate by deliberately preserving entropy during training rather than letting it drain away.
The reason this matters at the token level is that not all tokens carry the entropy. Only about 20% of tokens are genuinely high-entropy — these are the 'forking points' where the model is actually deciding between reasoning paths, and RLVR primarily adjusts exactly these tokens; training on that minority alone matches full-gradient updates Do high-entropy tokens drive reasoning model improvements?. So entropy collapse isn't a uniform flattening of the distribution — it concentrates on those few pivotal decisions, and when they sharpen prematurely the model stops branching where branching is what produced the reasoning gains. A complementary view shows that reasoning chains internally rank tokens by functional role, with symbolic-computation tokens preserved and grammar/meta-discourse pruned first Which tokens in reasoning chains actually matter most? — suggesting the distribution's 'shape' is structured, not flat, and collapse erodes the wrong parts.
The sharpest twist in the corpus is that the framing itself may be partly a measurement artifact. Looking at hidden states rather than output tokens, the supposed exploration–exploitation trade-off shows near-zero correlation; it only appears to be a hard trade-off when you measure at the token level. Effective-Rank analysis lets methods like VERL enhance exploration and exploitation simultaneously, with double-digit accuracy gains Is the exploration-exploitation trade-off actually fundamental?. In other words, the token-level distribution is where collapse becomes visible and constraining — but the underlying representational capacity for diverse reasoning may not be collapsing in the same way.
That reframing points toward an interesting escape route: if discrete token sampling is what forces the premature commitment, you can avoid collapsing the distribution at all. Soft Thinking keeps the probability distribution alive as a continuous 'concept token,' preserving a superposition of reasoning paths instead of picking one, and gets accuracy gains while using fewer tokens via entropy-based early stopping Can we explore multiple reasoning paths without committing to one token?. Meta's Large Concept Model goes further, abandoning token-level generation for sentence-level reasoning in embedding space Can reasoning happen at the sentence level instead of tokens?. Both are bets that the constraint entropy collapse imposes lives specifically in the discrete-token bottleneck.
Worth knowing as a backdrop: chain-of-thought may be constrained imitation rather than genuine inference, with failures bounded by the training distribution Why does chain-of-thought reasoning fail in predictable ways?, and reasoning breakdowns track instance novelty rather than task complexity Do language models fail at reasoning due to complexity or novelty?. If reasoning is fundamentally pattern-matching over familiar instances, then preserving entropy is preserving the model's access to a wider slice of those patterns — which reframes 'entropy collapse' as the model narrowing the set of remembered solutions it's still willing to reach for.
Sources 8 notes
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.
Training-free method replaces discrete token selection with probability-weighted concept embeddings, preserving superposition of reasoning paths. Improves accuracy up to 2.48 points while reducing tokens 22.4% via entropy-based early stopping.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.