Why do certain tokens at certain difficulties drive most of RLVR's learning signal?
This explores why RLVR's learning concentrates so unevenly — why a small subset of training examples, sorted by difficulty, carries most of the useful gradient while the rest contribute noise or harm.
This explores why RLVR's learning concentrates so unevenly — why a small subset of training examples, sorted by difficulty, carries most of the useful gradient while the rest contribute noise or harm. The corpus's sharpest answer is an inverted-U: learning peaks at medium difficulty and falls off at both ends Why do medium-difficulty problems teach reasoning better than hard ones?. The reason is mechanical rather than mysterious. RLVR learns from *advantage* — the spread between better and worse trajectories in a group. Easy problems the model already solves every time produce no spread, so there's nothing to push against. Impossibly hard problems the model almost never solves also produce no usable spread, and worse, the rare accidental success gets treated as a high-value trajectory under group-relative normalization. Only the medium band gives you both frequent-enough success and informative failure in the same batch — that's where the signal lives.
The failure at the hard end is not just dead weight, it's actively corrosive. Training on nearly-impossible samples teaches degenerate shortcuts — answer repetition, computation-skipping — because the normalization machinery rewards those rare flukes as if they were skill Do overly hard RLVR samples actually harm model capabilities?. And those shortcuts don't stay contained; they bleed back into capabilities the model already had. So 'certain difficulties drive the signal' has a darker companion claim: the wrong difficulties drive an *anti-signal* that contaminates the rest.
Now the deeper layer — why *tokens*, not just problems. A growing line of work argues RLVR isn't teaching new reasoning at all; it's *activating* behaviors already latent from pretraining Why does RLVR work with completely random rewards? What does reward learning actually do to model reasoning?. The most startling evidence: random or even incorrect rewards still improve some models, because the optimization pressure surfaces a pretrained code-reasoning habit rather than installing anything new — and this only works for models whose pretraining laid that habit down Why do random rewards improve reasoning for some models but not others?. If RLVR is a phase transition that reweights an existing distribution rather than a teacher, then the high-leverage tokens are precisely the ones that tip that transition — the format-defining, branch-selecting tokens where the model commits to one pretrained pattern over another. Relatedly, RL has been shown to converge hard onto a single dominant pretraining format within the first epoch, collapsing the alternatives Does RL training collapse format diversity in pretrained models?. The learning signal is concentrated because the *choice points* are concentrated.
This reframes the difficulty story laterally. Medium difficulty matters not because medium problems are pedagogically ideal, but because they're the regime where the model's pretrained distribution is genuinely uncertain — where a few decisive tokens can swing the outcome, and therefore where advantage is largest and most teachable. At low difficulty the choice is already made; at high difficulty no token rescues a path the base model can't reach. Several notes converge on the ceiling this implies: RLVR sharpens sampling efficiency within the base model's existing boundary rather than expanding it Does RLVR actually expand what models can reason about?, and over-optimizing can collapse the boundary inward by punishing exploration Why does RLVR training narrow a model's problem solving ability?.
The thing you might not have expected to learn: the same concentration that makes RLVR efficient also makes it shallow. The high-signal tokens improve *local* coherence between adjacent reasoning steps without guaranteeing the proof is globally valid Does RLVR actually improve mathematical reasoning or just coherence?, and benchmark gains can be separated into genuine behavioral activation versus mere memorization on contaminated data Can genuine reasoning activation coexist with contaminated benchmarks?. So 'which tokens drive the signal' and 'does the signal mean what we think' turn out to be the same question wearing two hats.
Sources 10 notes
RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.