Why does RLVR increase token entropy while decreasing answer diversity?

This explores an apparent paradox in reinforcement learning from verifiable rewards: how the same training that keeps (or sharpens) uncertainty at certain decision tokens can simultaneously collapse the variety of final answers a model produces.

This explores an apparent paradox in RLVR — token-level uncertainty going *up* while the diversity of final answers goes *down* — and the corpus resolves it by showing these two things happen in different places. The key insight is that RLVR doesn't touch all tokens equally. Only about 20% of tokens are high-entropy "forking points" — the pivotal moments where the model decides which way a line of reasoning will branch — and RLVR concentrates almost all of its adjustment there. Training on just those minority tokens matches or beats updating everything Do high-entropy tokens drive reasoning model improvements?. So entropy isn't suppressed at the choice points; it's preserved or even amplified, because that's where the learning signal lives.

Meanwhile, the *answer* distribution collapses for a separate reason. Outcome-based RL rewards only the final correct answer, which sharpens the whole policy toward the trajectories that already work — and that sharpening transfers globally, draining diversity even on problems the model hasn't solved yet Does outcome-based RL diversity loss spread across unsolved problems?. The same compression shows up in search agents, where RL squeezes exploration into a few narrow reward-maximizing strategies through what's been called entropy collapse, while SFT on diverse demonstrations keeps exploration broad Does reinforcement learning squeeze exploration diversity in search agents?. So you get local uncertainty preserved at forks, global probability mass piling onto winning answers.

There's a deeper mechanism underneath. RLVR mostly doesn't teach new reasoning — it activates strategies already latent in pretraining and makes the model sample them more efficiently within its existing capability boundary What does reward learning actually do to model reasoning?. Controlled experiments show RL amplifying a single dominant pretraining *format* within the first epoch while collapsing the alternatives, with the winner determined by model scale rather than by which format performs best Does RL training collapse format diversity in pretrained models?. That's the diversity loss made concrete: many viable phrasings and approaches existed in the base model, and RL picks one lane.

What you didn't know you wanted to know: the diversity collapse isn't always bad, and it isn't even always in the same direction. Whether convergence helps depends on what the domain rewards. RLHF reduces lexical-syntactic diversity in code — where there's a right answer to converge on — but *increases* it in creative writing, where distinctiveness is the reward Does preference tuning always reduce diversity the same way?. And the collapse is reversible by design: explicitly rewarding semantic diversity during RL catalyzes exploration and yields *higher* quality than quality-only training on both math and creative tasks Can diversity optimization improve quality during language model training?. The lesson is that diversity loss is a property of the reward shape, not an inevitable cost of RL — preserving uncertainty at the forking tokens and preserving diversity in the answers turn out to be two different knobs.

Sources 7 notes

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Why does RLVR increase token entropy while decreasing answer diversity?

Sources 7 notes

Next inquiring lines