INQUIRING LINE

How does the optimal difficulty band shift as the model's capabilities improve during training?

This explores how the 'sweet spot' difficulty of training problems doesn't stay fixed — as a model gets better, the problems that teach it the most keep moving, and the corpus has a surprising amount to say about why.


This explores how the optimal difficulty band shifts as a model's abilities grow during training — and the short answer is that it's a moving target that drifts faster than most training setups account for. The core insight is that a problem's teaching value isn't a property of the problem at all. It's a property of the *relationship* between the problem's difficulty and what the model can currently do. A sample that was richly informative at step 100 can become useless or even harmful by step 200, because the model has outgrown it How does model ability change what samples teach?.

Where does the band sit at any given moment? The corpus points consistently to the middle. Learning gains follow an inverted-U across difficulty: medium-hard problems teach best because they mix enough successes to give a usable signal with enough failures to be informative, while easy problems have no variance to learn from and brutally hard ones produce almost no successes Why do medium-difficulty problems teach reasoning better than hard ones?. As the model improves, that medium zone slides upward — yesterday's hard problem becomes today's productive-medium, and yesterday's medium becomes too easy to bother with. The implication is that static difficulty labels go stale within steps, so any curriculum that fixes difficulty up front is calibrating to a model that no longer exists.

The stakes for getting this wrong are higher than just wasted compute. Feeding a model problems that sit *above* its current band doesn't just fail to help — it actively damages capabilities the model already had. On near-impossible problems, rare accidental successes get treated as high-value trajectories by group-relative advantage normalization, which reinforces degenerate shortcuts like answer-repetition and skipped computation, and those shortcuts then contaminate previously sound reasoning Do overly hard RLVR samples actually harm model capabilities?. So the upper edge of the band isn't a soft ceiling you can safely overshoot; crossing it is corrosive. The same logic shows up in knowledge distillation: teacher-refined data that exceeds the student's current learning frontier degrades performance even when it's objectively higher quality, so students should filter refinements against their own ability rather than chase the best available signal Does teacher-refined data always improve student model performance?.

The more interesting wrinkle is that the band may not even be one-dimensional. Training tends to move through phases, which means *which kind* of difficulty matters shifts too. RL training reliably runs through a first phase where execution correctness is the bottleneck and a second phase where strategic planning becomes the limiting skill — so the right kind of challenge early (get the steps right) differs from the right kind later (plan better) Does RL training follow a predictable two-phase learning sequence?. There's also a curriculum angle: dense, step-wise expert-similarity rewards can keep small models learning on problems that would otherwise be all-failure — effectively widening the bottom of the usable band before handing off to sparse outcome rewards once the model is strong enough to succeed on its own Can step-wise expert rewards help small models learn hard reasoning?.

The quiet thread running under all of this: chasing the difficulty band aggressively costs you plasticity. Models that drift far from their base distribution lose the ability to keep learning new tasks, while staying close to base preserves the room to adapt later Does staying close to the base model preserve learning ability?. So the real design problem isn't just tracking the band as it rises — it's tracking it without burning the model's capacity to keep climbing.


Sources 7 notes

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Why do medium-difficulty problems teach reasoning better than hard ones?

RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Next inquiring lines