Why do medium-difficulty problems teach reasoning better than hard ones?
Does harder always mean better for learning? This explores why easy and extremely hard samples produce weak training signals in RLVR, while medium-difficulty problems drive the strongest improvements.
It is tempting to assume harder training problems teach more — that pushing the model against the limit of its ability is where reasoning improves. RLVR does not behave that way. Difficulty-wise and one-sample analysis reveals an inverted-U: easy and medium-difficulty problems yield the strongest and most stable reasoning improvements, while overly hard problems provide weak learning signals and can actively degrade performance.
The mechanism runs through group-relative advantage. Easy problems are mostly solved, so within-group reward variance is low and the relative-advantage signal is small. Overly hard problems are mostly failed, so they too produce weak relative-advantage signals — and worse, the rare accidentally-rewarded trajectory (a shortcut, an incomplete computation that lands on the right answer) gets amplified by group-relative normalization into a biased update. Medium-difficulty problems sit where the model succeeds often enough to learn from contrast but fails often enough that success is informative — the regime where advantage estimation has the most signal.
Why it matters: this is a curriculum claim with teeth. It says the standard instinct to harvest hard examples for RLVR is counterproductive without intervention, and it explains why in terms of the advantage estimator rather than vague "too hard to learn." The practical move is difficulty-adaptive: either filter toward the medium band or repair hard samples (the paper proposes backward-reasoning reformulation and feature-guided signals to raise reward density). The counterpoint is that "medium difficulty" is defined relative to the model's current capability — so the productive band moves as training proceeds, which is the seam where this static finding meets the dynamic-informativeness problem.
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do weaker models generate better training data than stronger models?
- Why do weaker teacher models sometimes produce better training signals than stronger ones?
- How does a challenger's escalating difficulty function as curriculum?
- Why do medium-difficulty problems produce more stable learning gains?
- How do difficulty metrics relate to the true value of training examples?
- Why does medium difficulty outperform both easy and hard RLVR training samples?
- Why do adaptive curriculum schemes outperform static difficulty filters?
- Does the productive difficulty band ever stabilize during training?
- How does difficulty-adaptive curriculum learning change which samples get selected during training?
- How does the optimal difficulty band shift as the model's capabilities improve during training?
- Why do certain tokens at certain difficulties drive most of RLVR's learning signal?
- Why do students learn better from explanations than from solving problems from scratch?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can adaptive guidance from solution traces reduce reward sparsity in RL?
When reinforcement learning struggles with hard problems due to sparse rewards and zero-advantage rollouts, does providing partial solution traces as adaptive guidance help the model learn more efficiently? This matters because standard RL wastes compute on unsolvable problems.
proposes the repair this note motivates: feeding partial traces converts unproductive hard samples into learnable ones
-
Can we prune training data without hurting model performance?
This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
generalizes the selectivity claim: difficulty-based selection of training data changes scaling behavior, here applied to RLVR rollouts
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
difficulty as an allocation signal at inference mirrors difficulty as a curriculum signal at training
-
Do high-entropy tokens drive reasoning model improvements?
Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.
both locate RLVR's signal in a sparse productive subset; difficulty band at the sample level, forking tokens at the token level
-
Do overly hard RLVR samples actually harm model capabilities?
Explores whether training on problems beyond a model's competence band causes active regression rather than mere learning failures. Investigates whether group-relative normalization amplifies accidental successes into harmful shortcuts.
grounds: the downhill half of the inverted-U — names the mechanism (shortcut amplification, degeneracy) by which over-hard samples actively degrade rather than merely fail to help
-
What reasoning features does each difficulty level reinforce?
When models train on problems of different difficulty, do they build the same internal reasoning machinery or different kinds? This matters because accuracy gains alone hide what's actually being learned.
extends: explains what each difficulty band teaches at the feature level, deepening the inverted-U from "how much signal" to "which features get reinforced"
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example
- The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
Original note title
sample difficulty has a non-monotonic effect on rlvr where medium-difficulty problems yield the strongest most stable gains