INQUIRING LINE

Why do medium-difficulty problems produce more stable learning gains?

This explores why problems of moderate difficulty — not too easy, not too hard — give the most reliable improvement when training models with reinforcement learning, and what's actually happening inside the model that makes that band productive.


This explores why moderate-difficulty problems produce the most reliable learning, rather than the easy or near-impossible ones. The corpus has a clear answer with a memorable shape: learning across difficulty follows an inverted-U curve. Medium problems win because they balance enough successes to give the model a foothold with enough failures to be informative — the learning signal is strongest where outcomes are genuinely uncertain Why do medium-difficulty problems teach reasoning better than hard ones?. Easy samples lack variance (the model already wins, so there's nothing to learn), and hard samples are where things actively break.

What makes this more than a tuning heuristic is what the corpus says happens *inside* the model at each difficulty. Easy problems reinforce answer shortcuts while suppressing deliberation; hard problems only occasionally succeed, so deliberate reasoning gets rewarded rarely; medium difficulty is the one band that strengthens both shortcut-resistance and genuine reasoning at once What reasoning features does each difficulty level reinforce?. That's the deeper reason the gains are *stable* rather than just large — identical accuracy improvements can mask opposite internal changes, and only the medium band reinforces the durable kind.

The failure mode at the hard end is worth understanding because it explains the instability you avoid. Training on near-impossible problems doesn't just waste signal — it degrades the model. Because group-relative reward normalization treats a rare accidental success as a high-value trajectory, the model learns to repeat answers and skip computation, and these degenerate shortcuts contaminate capabilities it already had Do overly hard RLVR samples actually harm model capabilities?. So 'stable gains' from medium problems is partly the absence of this active corruption.

Here's the twist that makes 'medium' harder than it sounds: difficulty isn't a fixed property of a problem. A sample's teaching value depends on the interaction between its difficulty and the model's *current* ability, so the productive band drifts as the model improves — a problem that was medium at step 100 may be trivial by step 300 How does model ability change what samples teach?. Stable learning therefore comes not from picking medium problems once, but from continuously re-centering on the moving target.

The corpus also offers escape hatches for when you can't find that band. If even your hardest useful problems produce all-failure rollouts, step-wise expert-similarity rewards give a dense signal by scoring each move against an expert, so the model learns something even when no full attempt succeeds Can step-wise expert rewards help small models learn hard reasoning?. And in a striking counterpoint to the difficulty-curve framing, a single well-chosen example can activate latent reasoning and keep improving test accuracy long after training accuracy saturates Can a single training example unlock mathematical reasoning? — a reminder that the medium-difficulty story is about *signal quality*, not problem quantity.


Sources 6 notes

Why do medium-difficulty problems teach reasoning better than hard ones?

RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.

What reasoning features does each difficulty level reinforce?

Easy problems reinforce answer shortcuts while suppressing deliberation; hard problems activate reasoning features only on rare success; medium difficulty strengthens both simultaneously. Identical accuracy gains can reflect opposite internal changes.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing claims about sample difficulty and learning stability in RLVR. The question remains open: why do medium-difficulty problems produce more stable learning gains than easy or hard ones?

What a curated library found — and when (dated claims, not current truth): Findings span 2024–2026.
• Medium-difficulty problems follow an inverted-U curve: they balance enough successes to ground learning with enough failures to provide signal, whereas easy problems suppress deliberation and hard problems rarely succeed (~2025).
• Easy samples reinforce shortcuts and suppress reasoning; hard samples reward rare accidental successes, causing models to learn degenerate skip-computation behaviors that contaminate existing capabilities (~2026).
• The productive difficulty band is NOT static — it drifts as model ability improves, so a medium problem at step 100 may be trivial by step 300, requiring continuous re-centering (~2026).
• Supervised RL (step-wise expert similarity) provides dense learning signals even on all-failure rollouts (~2025).
• A single well-chosen example can activate latent reasoning and sustain accuracy gains long after training saturates (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2605.28388 — Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs (2026-05)
• arXiv:2510.25992 — Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning (2025-10)
• arXiv:2504.20571 — Reinforcement Learning for Reasoning in LLMs with One Training Example (2025-04)
• arXiv:2605.12484 — Learning, Fast and Slow: Towards LLMs That Adapt Continually (2026-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the inverted-U curve, degenerate hard-sample corruption, and dynamic difficulty drift: has stronger curriculum learning, online difficulty adaptation, or newer reward-shaping (e.g., implicit preference models, outcome-supervision hybrids) since relaxed or overturned these limits? Where does the constraint still hold? Cite evidence.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially any showing linear difficulty scaling, one-shot activation without medium-difficulty tuning, or stable gains on uniformly hard problems.
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can adaptive curricula that track model uncertainty eliminate the need to manually re-center difficulty? (b) Does the instability of hard-sample training persist when combined with uncertainty-weighted or divergence-aware reward normalization?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines