How should guidance levels adapt as the model's capability boundary shifts?
This reads the question as: how much scaffolding (solution traces, hints, reward signals) should a model get during training, given that what counts as 'too hard' or 'too easy' keeps moving as the model improves.
This explores how the *amount* of help a model needs is a moving target — tied not to a problem's fixed difficulty but to the gap between that difficulty and the model's current ability. The corpus's sharpest claim here is that a sample's teaching value is dynamic: the same problem that's a useful stretch today becomes either trivial or impossible within a few training steps, so any static difficulty label goes stale fast How does model ability change what samples teach?. The practical consequence is that guidance has to track a drifting 'productive band' of medium-hard problems rather than being set once.
The most direct answer to the question is to make guidance conditional on where each problem sits relative to that boundary. One approach hands over ground-truth solution traces only on problems the model currently can't crack on its own, while letting it learn unaided on manageable ones — converting compute that would otherwise be wasted on impossible problems (zero reward signal) into usable learning, for a few points of benchmark gain Can adaptive guidance from solution traces reduce reward sparsity in RL?. As the boundary shifts outward, fewer problems should trigger the trace-handout; the scaffolding withdraws as capability catches up. This is guidance as a thermostat, not a fixed dose.
There's a subtler reframing in the corpus worth knowing: a lot of what RL 'guidance' does isn't installing new capability at all, but teaching the model *when* to deploy reasoning it already latently has — hybrid models recover 91% of gains by routing tokens alone Does RL post-training create reasoning or just deploy it?. If the boundary you're pushing is a deployment boundary rather than a capability one, the guidance that matters is about timing and not-switching-too-early — penalizing premature thought-transitions improves accuracy with no retraining at all Do reasoning models switch between ideas too frequently?. So 'guidance level' isn't one knob; it's different knobs depending on which boundary is actually moving.
The boundary itself also moves in a predictable shape, which tells you *what kind* of guidance to front-load. RL training tends to run in two phases — first execution/procedural correctness is the bottleneck, then strategic planning takes over — so guidance weighted toward execution early and toward planning later matches where the learning signal actually concentrates Does RL training follow a predictable two-phase learning sequence?. And not every skill rides the same curve: logical reasoning keeps improving with scale while metacognition and style saturate early, meaning the capability boundary advances at different rates for different competencies, and uniform guidance over-serves the skills that already plateaued Do all AI skills improve equally as models scale?.
The thing you didn't know you wanted to know: there's a hard floor on how far self-supplied guidance can take you. Pure self-improvement stalls on the generation-verification gap and quietly collapses diversity or hacks its own reward — the methods that actually keep working smuggle in an *external* anchor (a past checkpoint, a third-party judge, a tool result, a user correction) Can models reliably improve themselves without external feedback?. So the right answer to 'how should guidance adapt' includes a constraint: as you withdraw scaffolding near the frontier, you can't replace it with the model grading itself. The guidance that's left has to stay externally grounded, or the boundary stops moving.
Sources 7 notes
A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.
GHPO dynamically provides ground-truth solution traces for hard problems while using standard RL for manageable ones, achieving 5% gains across math benchmarks. This converts wasted compute on impossible problems into learning signal by leveraging traces already present in training data.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.