How does model ability change what samples teach?
Does a sample's learning value stay fixed, or does it shift as the model improves? Understanding whether informativeness is a moving target could explain why fixed difficulty filters underperform adaptive ones during training.
If medium-difficulty problems carry the strongest RLVR signal, the obvious question is: medium relative to what? The difficulty findings are stated against the model's current capability — a problem is "hard" because this model fails it now, not because of any intrinsic property. That makes informativeness a relational, moving quantity. A problem that is over-hard at step zero (weak signal, shortcut amplification) can become medium-difficulty after the model improves, at which point it starts contributing the strongest gradient. And a problem that was medium early becomes easy and stops teaching.
This is the open problem the static difficulty bucketing leaves unresolved. The one-sample dynamics show that which features a sample reinforces depends on whether successful trajectories are sampled — and sampling success on a given problem changes as the policy moves. So the curriculum cannot be set once from a fixed difficulty estimate; the productive band drifts under the policy as training proceeds. A sample's value is co-determined by its difficulty and the model's evolving capability, and neither factor alone predicts informativeness.
Why it matters: it converts a clean prescriptive result ("train on medium-difficulty samples") into a control problem ("track which samples are currently in the productive band and re-rank continuously"). It also explains why fixed difficulty filters underperform adaptive schemes — the filter is correct only at the instant it was computed. The unresolved part is how cheaply you can estimate current informativeness online: re-estimating per-sample difficulty every few steps is expensive, and proxies (recent pass rate, reward variance) are noisy. This is a question worth tracking because adaptive-curriculum RLVR depends on solving it efficiently.
Inquiring lines that use this note as a source 15
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How should guidance levels adapt as the model's capability boundary shifts?
- How does modified PPO handle samples from much older model versions?
- How does training data distribution determine what models can learn?
- Why do medium-difficulty problems produce more stable learning gains?
- How do difficulty metrics relate to the true value of training examples?
- Why does test accuracy improve after training accuracy reaches 100 percent?
- Why does medium difficulty outperform both easy and hard RLVR training samples?
- Does importance sampling actually recover capabilities lost to hard sample training?
- Why do adaptive curriculum schemes outperform static difficulty filters?
- Can we cheaply estimate which samples are currently most informative?
- Does the productive difficulty band ever stabilize during training?
- How does difficulty-adaptive curriculum learning change which samples get selected during training?
- What features does a sample reinforce when it moves bands?
- How does the optimal difficulty band shift as the model's capabilities improve during training?
- What mechanisms cause overly hard samples to degrade prior model performance?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do medium-difficulty problems teach reasoning better than hard ones?
Does harder always mean better for learning? This explores why easy and extremely hard samples produce weak training signals in RLVR, while medium-difficulty problems drive the strongest improvements.
the static finding this note dynamizes: the inverted-U is correct only relative to a fixed capability snapshot
-
Can a single training example unlock mathematical reasoning?
Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.
evidence that capability keeps moving even after apparent saturation, so the difficulty-capability relation never settles
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
bounds the dynamic: if RLVR only reshapes sampling within fixed boundaries, the productive band drifts but cannot move past the base model's frontier
-
Can we prune training data without hurting model performance?
This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
static-pruning counterpart; the dynamic-informativeness view argues the pruning criterion must be recomputed as capability evolves
-
What reasoning features does each difficulty level reinforce?
When models train on problems of different difficulty, do they build the same internal reasoning machinery or different kinds? This matters because accuracy gains alone hide what's actually being learned.
grounds: gives the feature-level content of "informativeness" — what a sample teaches, and thus its value, shifts with the band it currently occupies
-
Should training maximize diversity when models feed into search?
If a model runs inside a test-time search loop that samples many rollouts and picks the best, does training for entropy and diversity unlock better solutions than training for a single sharp answer?
extends: a related control-problem reframe of RL training objectives, where what to optimize for changes with deployment regime rather than being fixed once
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- KellyBench: Can Language Models Beat the Market?
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
- Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
- Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
- How new data permeates LLM knowledge and how to dilute it
- On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
- A Survey on Concept Drift Adaptation
Original note title
sample informativeness is dynamic depending on the interaction between task difficulty and the models evolving capability