INQUIRING LINE

How does difficulty-adaptive curriculum learning change which samples get selected during training?

This explores how matching problem difficulty to a model's current ability reshapes which training samples are actually useful — and why that target keeps moving as the model learns.


This explores how difficulty-adaptive curriculum learning changes sample selection during training. The corpus's central move is to reject difficulty as a fixed property of a problem: a sample's teaching value comes from the *interaction* between its difficulty and the model's current skill, so the useful set is a moving target rather than a fixed shortlist How does model ability change what samples teach?. That reframing is what makes a curriculum necessary in the first place — a static difficulty label goes stale within a few training steps.

Where does the productive zone sit? Several notes converge on an inverted-U: medium-difficulty problems give the strongest gains because they mix enough successes to reinforce with enough informative failures to learn from, while easy samples carry no signal variance and very hard ones backfire Why do medium-difficulty problems teach reasoning better than hard ones?. The backfire is concrete and worth knowing: near-impossible problems don't just waste compute, they teach degenerate shortcuts — answer repetition, skipped computation — that group-relative reward normalization amplifies by treating a rare lucky success as a high-value trajectory, contaminating skills the model already had Do overly hard RLVR samples actually harm model capabilities?. So adaptive selection isn't just about efficiency; it's damage control.

The same logic shows up from the *receiving* side. When a stronger teacher refines training data, the refinements only help if they sit inside the student's learning frontier — objectively better data degrades a weaker student that can't yet absorb it, so the student should filter by its own profile rather than accept everything Does teacher-refined data always improve student model performance?. That's curriculum selection wearing different clothes: keep only what's currently learnable.

Where the corpus gets genuinely interesting is that curriculum doesn't have to mean re-sorting a problem pool. Reverse-curriculum RL manufactures difficulty inside a single problem by sliding the reasoning start state backward from near-completion outward, turning one hard task into a graded sequence and approximating expensive step-level supervision using only outcome feedback Can curriculum learning approximate expensive process supervision?. And step-wise expert-similarity rewards keep otherwise-too-hard samples in play by handing out dense partial credit even when every rollout fails — which is why that method works best as a curriculum *foundation* before you switch to sparse outcome rewards Can step-wise expert rewards help small models learn hard reasoning?. Both quietly change which samples are selectable by changing what counts as a usable signal.

The thread worth leaving with: adaptive curriculum reshapes selection on two axes at once. It moves the difficulty window to track rising ability, and it changes the *granularity* of the reward so that samples too hard to learn from at the outcome level become learnable at the step level. There's even a hint the model does a version of this internally — hidden states sparsify selectively as tasks get harder, acting as an adaptive filter under unfamiliar load rather than simply breaking Do language models sparsify their activations under difficult tasks?.


Sources 7 notes

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Why do medium-difficulty problems teach reasoning better than hard ones?

RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Next inquiring lines