Does policy entropy collapse limit how many iterations of reasoning training work?
This explores whether the 'entropy collapse' problem in reinforcement-learning reasoning training — where a model's exploration narrows to near zero — sets a hard ceiling on how far you can push that training, and what the corpus says you can do about it.
This explores whether policy entropy collapse caps the returns from RL-based reasoning training. The short answer from the collection: yes, but it's a managed limit, not an absolute wall. The clearest statement of the ceiling comes from work showing that reasoning performance follows a predictable law — R = -a·exp(H) + b — where gains saturate as the policy's entropy drifts toward zero Does policy entropy collapse limit reasoning performance in RL?. In plain terms: as training keeps reinforcing the few strategies that already win reward, the model stops trying anything new, and once it stops exploring, further iterations buy you almost nothing. The same study names the levers — Clip-Cov, KL-Cov, GPPO — that deliberately slow entropy's decline to keep exploratory capacity alive longer.
What makes this more than a single finding is that the same collapse mechanism shows up in a neighboring domain. RL-trained search agents lose behavioral diversity exactly the way reasoning models do — policies converge on narrow reward-maximizing moves — and supervised fine-tuning on diverse demonstrations restores the breadth that RL squeezes out Does reinforcement learning squeeze exploration diversity in search agents?. So the limit isn't a quirk of reasoning tasks; it's a general property of how reward maximization compresses a policy. That reframes the iteration question: you're not fighting a task-specific bug, you're managing a trade-off baked into RL itself.
Where the entropy actually lives turns out to be surprisingly concentrated. Only about 20% of tokens — the high-entropy 'forking points' where the model decides which way to reason — carry the learning signal, and training on just those matches full updates Do high-entropy tokens drive reasoning model improvements?. This connects to a two-phase picture of RL training: an early phase that locks in execution correctness, followed by a phase where strategic planning becomes the bottleneck and planning-token entropy actually rises while execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. Read together, these suggest collapse isn't uniform — it's the planning/forking tokens whose diversity you most need to protect across iterations, and that's also where late-stage gains are still available.
There's a deeper reason iterations hit diminishing returns, and it's worth knowing: several independent results argue RL mostly *selects* reasoning that base models already contain rather than creating new capability — RL steering, critique tuning, SAE feature steering, and RLVR all elicit latent ability already present in base activations Do base models already contain hidden reasoning ability?. If post-training is fundamentally elicitation, then once you've surfaced what's there, more iterations have less left to find — entropy collapse is partly the symptom of running out of new behaviors to select. That also explains why some teams are trying to move reasoning earlier, planting chain-of-thought during pretraining itself with information-gain rewards rather than relying on RL to install it afterward Can chain-of-thought reasoning be learned during pretraining itself?.
The thing you might not have expected to want: the ceiling can be sidestepped by changing the training objective rather than just nursing entropy. Approaches that reward explanation quality and not just answer correctness internalize knowledge more durably than token-level supervision Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?, and energy-based transformers reach deliberative 'System 2' behavior from unsupervised learning with better scaling and out-of-distribution generalization — no reward-maximizing collapse to manage at all Can energy minimization unlock reasoning without domain-specific training?. The honest caveat underneath all of it: even well-trained chain-of-thought degrades predictably outside its training distribution Does chain-of-thought reasoning actually generalize beyond training data?, so preserving entropy keeps iterations *productive* — it doesn't promise the reasoning you get is robust.
Sources 9 notes
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.