How does baseline capability level affect RL improvement ceiling?

This explores whether a model's starting (base) capability sets a hard ceiling on how much reinforcement learning can improve it — and what the corpus reveals about why some models have more headroom than others.

This explores whether a model's starting capability puts a ceiling on what RL can add — and the corpus's most striking answer is that for a lot of tasks, RL doesn't raise the ceiling at all; it just helps the model reach what it already had. Several notes converge on the idea that base models already contain reasoning capability in latent form, and RL mostly teaches *when* to deploy it rather than *how* to do something new Does RL post-training create reasoning or just deploy it?. One study even mechanically backs this: RL's gains come largely from suppressing wrong trajectories rather than inventing right ones, sparsely updating a small slice of parameters What actually changes inside a model during RL training?. If RL is unlocking suppressed ability, then the base model's latent repertoire is the ceiling.

But that's not the whole story, and this is the part worth knowing: whether the ceiling is fixed depends on the *kind* of task. For standard reasoning, RL activates what's already there — so a weak base model stays roughly capped by its latent abilities. For complex multi-step planning, RL has been shown to generate genuinely novel strategies that the base model couldn't reach even with massive sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. So 'baseline determines the ceiling' is true in some domains and false in others.

The ceiling also turns out to be set less by the base model than by your *recipe* and *reward signal*. A large-scale scaling study (200+ models, 400K GPU-hours) found RL performance climbs along predictable sigmoid curves where the recipe sets the asymptote and implementation details only affect how fast you get there Does RL training follow predictable scaling curves?. Separately, RL produces dramatic gains (one task went from 0.15% to 73.98%) when rewards are binary and verifiable, but only modest movement when the signal is fuzzy judgment Why does RL succeed more on some tasks than others?. So two models with identical baselines can hit wildly different ceilings depending on how cleanly their task can be scored.

There's a counterintuitive twist on the 'harder problems push the ceiling higher' instinct: training on problems that are too hard for the current model actively *lowers* the ceiling. Near-impossible samples teach degenerate shortcuts — answer repetition, skipped computation — that then contaminate abilities the model already had Do overly hard RLVR samples actually harm model capabilities?. This implies the productive zone for RL sits just above current capability, not far beyond it — which is exactly why a 14B model with careful trajectory filtering can reach frontier math performance that its raw baseline wouldn't predict Why do correct code trajectories teach models to tolerate errors?.

The deepest limit, though, isn't about capability at all — it's about feedback. Pure self-improvement, where a model tries to bootstrap past its own baseline with no external signal, hits a structural wall: the generation-verification gap, diversity collapse, and reward hacking. Every method that reliably improves smuggles in an outside anchor — a judge, a past version, a tool, a user correction Can models reliably improve themselves without external feedback?. So the real 'ceiling' on RL improvement is set jointly by what the base model latently knows, how cleanly the task can be scored, and whether there's a genuine external signal to climb toward.

Sources 8 notes

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Does RL training follow predictable scaling curves?

Large-scale study (400K GPU-hours, 200+ models) shows RL performance scales sigmoidally. Recipe choices set the ceiling; implementation details only affect efficiency. Stable recipes enable reliable extrapolation from small runs.

Why does RL succeed more on some tasks than others?

Binary verifiable rewards enable dramatic RL gains (0.15% to 73.98%), while judgment-based evaluation yields modest improvements (55% reduction). Clear reward signals unlock suppressed capabilities; fuzzy signals barely move the needle.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

How does baseline capability level affect RL improvement ceiling?

Sources 8 notes

Next inquiring lines