Why do certain tokens at certain difficulties drive most of RLVR's learning signal?

This explores why RLVR's learning concentrates so unevenly — why a small subset of training examples, sorted by difficulty, carries most of the useful gradient while the rest contribute noise or harm. The corpus's sharpest answer is an inverted-U: learning peaks at medium difficulty and falls off at both ends Why do medium-difficulty problems teach reasoning better than hard ones?. The reason is mechanical rather than mysterious. RLVR learns from *advantage* — the spread between better and worse trajectories in a group. Easy problems the model already solves every time produce no spread, so there's nothing to push against. Impossibly hard problems the model almost never solves also produce no usable spread, and worse, the rare accidental success gets treated as a high-value trajectory under group-relative normalization. Only the medium band gives you both frequent-enough success and informative failure in the same batch — that's where the signal lives.

The failure at the hard end is not just dead weight, it's actively corrosive. Training on nearly-impossible samples teaches degenerate shortcuts — answer repetition, computation-skipping — because the normalization machinery rewards those rare flukes as if they were skill Do overly hard RLVR samples actually harm model capabilities?. And those shortcuts don't stay contained; they bleed back into capabilities the model already had. So 'certain difficulties drive the signal' has a darker companion claim: the wrong difficulties drive an *anti-signal* that contaminates the rest.

Now the deeper layer — why *tokens*, not just problems. A growing line of work argues RLVR isn't teaching new reasoning at all; it's *activating* behaviors already latent from pretraining Why does RLVR work with completely random rewards? What does reward learning actually do to model reasoning?. The most startling evidence: random or even incorrect rewards still improve some models, because the optimization pressure surfaces a pretrained code-reasoning habit rather than installing anything new — and this only works for models whose pretraining laid that habit down Why do random rewards improve reasoning for some models but not others?. If RLVR is a phase transition that reweights an existing distribution rather than a teacher, then the high-leverage tokens are precisely the ones that tip that transition — the format-defining, branch-selecting tokens where the model commits to one pretrained pattern over another. Relatedly, RL has been shown to converge hard onto a single dominant pretraining format within the first epoch, collapsing the alternatives Does RL training collapse format diversity in pretrained models?. The learning signal is concentrated because the *choice points* are concentrated.

This reframes the difficulty story laterally. Medium difficulty matters not because medium problems are pedagogically ideal, but because they're the regime where the model's pretrained distribution is genuinely uncertain — where a few decisive tokens can swing the outcome, and therefore where advantage is largest and most teachable. At low difficulty the choice is already made; at high difficulty no token rescues a path the base model can't reach. Several notes converge on the ceiling this implies: RLVR sharpens sampling efficiency within the base model's existing boundary rather than expanding it Does RLVR actually expand what models can reason about?, and over-optimizing can collapse the boundary inward by punishing exploration Why does RLVR training narrow a model's problem solving ability?.

The thing you might not have expected to learn: the same concentration that makes RLVR efficient also makes it shallow. The high-signal tokens improve *local* coherence between adjacent reasoning steps without guaranteeing the proof is globally valid Does RLVR actually improve mathematical reasoning or just coherence?, and benchmark gains can be separated into genuine behavioral activation versus mere memorization on contaminated data Can genuine reasoning activation coexist with contaminated benchmarks?. So 'which tokens drive the signal' and 'does the signal mean what we think' turn out to be the same question wearing two hats.

Sources 10 notes

Why do medium-difficulty problems teach reasoning better than hard ones?

RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why does RLVR work with completely random rewards?

RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about token-level learning concentration in RLVR (reinforcement learning on verifiable reasoning). The question remains open: which tokens actually drive RLVR's learning signal, and does that signal improve reasoning or merely activate latent pretraining?

What a curated library found — and when (dated claims, not current truth, spanning April 2025–May 2026):
• Medium-difficulty samples produce peak learning via inverted-U; easy samples yield no advantage spread, hard samples induce degenerate shortcuts (Apr 2025).
• RLVR may activate pretrained reasoning patterns rather than teach new ones; even random/incorrect rewards improve some models by surfacing latent behaviors (Apr–Jul 2025).
• RL converges onto a single dominant pretraining format within one epoch, collapsing alternatives; choice-point tokens (format-defining, branch-selecting) concentrate the signal (Apr 2025).
• RLVR sharpens efficiency within the base model's capability boundary but does not expand it; over-optimization narrows scope (Jul 2025).
• Local trace coherence improves without guaranteeing global validity; benchmark gains separate into behavioral activation vs. memorization on contaminated data (Oct 2025).

Anchor papers (verify; mind their dates):
• 2504.07912 — Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining (Apr 2025)
• 2507.14843 — The Invisible Leash: Why RLVR May Not Escape Its Origin (Jul 2025)
• 2510.18176 — Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains (Oct 2025)
• 2605.28388 — Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs (May 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above—inverted-U difficulty, latent activation, format convergence, boundary stasis, local-vs-global coherence—judge whether newer scaling, architectural advances (e.g., sparse gating, modular RL), or post-training methods (e.g., process reward models, multi-objective RL) have since relaxed or overturned it. Distinguish durable questions (e.g., "Does RLVR truly expand reasoning or reweight pretraining?") from perishable constraints (e.g., "Medium difficulty is the only useful regime"). Cite what resolved each.
(2) Surface the strongest *contradicting or superseding* work from the last ~6 months that claims RLVR does expand capability beyond the base model, escapes pretraining lock-in, or achieves global validity without local-coherence tricks.
(3) Propose 2 research questions that assume the regime may have shifted: one assuming boundary expansion *is* now possible, one assuming token concentration has been decoupled from difficulty via better curricula or reward design.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do certain tokens at certain difficulties drive most of RLVR's learning signal?

Sources 10 notes

Next inquiring lines