How do execution and planning tokens differ in their entropy dynamics?
This explores the idea that not all tokens an LLM generates carry the same uncertainty — that the words where the model decides *what to do next* (planning/forking) behave differently from the words where it *carries out* a path already chosen (execution) — and what the corpus says about that split.
This reads the question as being about a split that recent reasoning research keeps rediscovering: a small set of high-stakes "decision" tokens versus a large mass of "follow-through" tokens, and how entropy — the model's uncertainty at each step — separates them. The corpus is unusually pointed on this. The clearest finding is that only about 20% of tokens in a reasoning trace are high-entropy, and those are the *forking points* where the model picks a direction; the other ~80% are low-entropy execution that mechanically completes whatever was just decided Do high-entropy tokens drive reasoning model improvements?. Strikingly, when you train reinforcement learning only on that high-entropy minority, you match or beat updating on every token — which means the planning tokens carry almost all of the learning signal, and the execution tokens are mostly along for the ride.
A second note sharpens *which* tokens those forks tend to be. Specific connector words like "Wait" and "Therefore" show sharp spikes in mutual information with the eventual correct answer — they are pivots where reasoning changes course, and suppressing them damages accuracy while suppressing the same number of random (execution) tokens does almost nothing Do reflection tokens carry more information about correct answers?. Put the two together and a picture emerges: planning tokens are sparse, high-entropy, information-dense hinge points; execution tokens are dense, low-entropy, and individually near-disposable. The entropy *is* the planning signal.
Where it gets interesting is what happens to that high-entropy mass over training. If you reward a model and let entropy fall unchecked, the forking tokens lose their uncertainty — the model stops branching, and performance hits a predictable ceiling described by the law R = -a·exp(H) + b, where reward saturates as policy entropy approaches zero Does policy entropy collapse limit reasoning performance in RL?. In other words, entropy collapse is the death of planning: the model converts everything into confident execution and loses the capacity to explore alternative directions. Interventions like Clip-Cov and KL-Cov exist specifically to protect the entropy of those decision points from being optimized away.
The corpus also adds a useful caution about *how* this distinction is even measured. One note argues the familiar "exploration vs. exploitation" trade-off is largely an artifact of looking at things token-by-token — at the level of hidden states, exploration and exploitation barely correlate, and you can push both at once Is the exploration-exploitation trade-off actually fundamental?. That's a quiet warning for anyone treating "planning tokens" as a clean, fixed category: the entropy signature is real and exploitable, but reading too much narrative into per-token measurements can invent trade-offs that vanish when you look at the model's internal representations instead.
The thing worth walking away with: a reasoning model spends most of its tokens executing and only a thin slice planning — and that thin, high-entropy slice is simultaneously where the learning happens, where connector words like "Wait" do their work, and where training quietly kills performance if it lets the uncertainty collapse. Almost everything that matters in reasoning RL is happening in 20% of the tokens.
Sources 4 notes
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.