Why does prolonged RL discover strategies absent from any base model sample?

This explores a genuine fight in the corpus: whether reinforcement learning can invent reasoning the base model never had, or whether it only sharpens sampling of strategies already latent inside — and what conditions tip it one way or the other.

This explores a genuine fight in the corpus, not a settled finding. One camp says RL invents nothing: it just gets better at fishing solutions out of a distribution the base model already contains. The sharpest version of that claim comes from pass@k analysis — at high k, base models actually *beat* their RL-trained versions, which means RL narrowed the search toward known answers rather than widening what's solvable at all Does RLVR actually expand what models can reason about?. The same picture appears in the finding that a single training example, or even a *spurious* reward, can trigger most of the gain — that's a signature of activation, not teaching What does reward learning actually do to model reasoning?. And a cleaner framing still: RL teaches a model *when* to reason, not *how*. Hybrid models recover 91% of the gains just by routing tokens, and the activation vectors for reasoning strategies exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?.

So why does the opposite result — strategies absent from *any* base sample — keep showing up? The reconciling note is that capability creation is **domain-conditional** Does reinforcement learning create new reasoning abilities or activate existing ones?. On standard reasoning, where the base model has seen the patterns, RL only activates what's latent. But on complex multi-step planning — where no established pattern exists to sample — RL generates genuinely novel strategies the base model can't reach even with extensive sampling. The 'prolonged RL' result lands here: trained long enough, on *diverse and non-mathematical* tasks, with KL control and policy resetting, RL-trained models win across *all* pass@k levels, which is the signature of an expanded boundary rather than a narrowed one Can reinforcement learning discover reasoning strategies base models cannot?. The disagreement between the two camps is largely a disagreement about which domains they tested.

What's quietly fascinating is *why prolonged-ness* matters, and a two-phase dynamic explains it. Early in training, RL is busy consolidating procedural execution — getting steps correct. Only in a second phase does strategic planning become the bottleneck, with planning-token entropy *rising* while execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. Novel strategy, in other words, is a late-training phenomenon: you can't reach the exploration phase without paying for the consolidation phase first. Short runs never get there, which is part of why the 'RL discovers nothing' studies and the 'RL discovers new strategies' studies disagree — they may be sampling different points on the same curve.

There's a structural reason this is even possible without scrambling the model. RL updates only 5–30% of parameters, and those sparse updates are nearly full-rank and nearly identical across random seeds — meaning the model is making a *structured*, targeted edit, not a diffuse one Does reinforcement learning update only a small fraction of parameters?. Staying close to the base distribution turns out to be load-bearing: low KL drift preserves the plasticity needed to keep learning, while parameter-only methods that drift hard simply stall when the domain shifts Does staying close to the base model preserve learning ability?. So 'prolonged' discovery isn't brute-force divergence from the base model — it's a long, narrow, stable walk that keeps the base intact while carving new planning behavior on top.

The catch worth knowing: this discovery is fragile and cuts against diversity. The same RL that finds new planning strategies also *compresses* behavioral diversity through entropy collapse — policies converge on narrow reward-maximizing paths, the same way they do in search agents, where SFT on diverse demonstrations is what preserves exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?. Push it with problems that are too hard and it doesn't discover at all — it learns degenerate shortcuts that then contaminate abilities the model already had Do overly hard RLVR samples actually harm model capabilities?. So the honest answer to the question is: prolonged RL discovers absent strategies only in a narrow regime — hard-but-tractable planning domains, long enough to reach the exploration phase, with the base model held close enough to stay plastic. Outside that regime, what looks like discovery is either activation of the latent, or active damage.

Sources 10 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher evaluating whether prolonged RL in LLMs genuinely discovers novel reasoning strategies absent from base-model samples, or merely activates latent ones. The question remains open.

What a curated library found — and when (findings span 2025–2026; treat as dated claims, not current truth):
• Pass@k analysis suggests RL narrows rather than expands the search space; base models outperform RL versions at high k, indicating activation not invention (arXiv:2504.13837, 2025-04).
• A two-phase dynamic exists: early RL consolidates procedural execution; only in late training does planning become the bottleneck, with planning-token entropy rising while execution stabilizes (inferred from 2025–2026 work).
• Prolonged RL discovers genuinely novel strategies *only* on non-mathematical, multi-step planning tasks where the base model has no established pattern; on standard reasoning, it activates latent capabilities (arXiv:2505.24864, 2025-05).
• RL updates only 5–30% of parameters in sparse, full-rank subnetworks, making structured, seed-stable edits; low KL drift from the base preserves plasticity (arXiv:2505.11711, 2025-05).
• RL training compresses behavioral diversity through entropy collapse and is fragile to overly hard samples, which induce degenerate shortcuts (inferred from 2026 mechanistic work).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (2025-04): pass@k refutation of discovery claim.
• arXiv:2505.24864 (2025-05): prolonged RL as domain-conditional expansion.
• arXiv:2505.11711 (2025-05): sparse, full-rank parameter updates.
• arXiv:2605.28388 (2026-05): mechanistic role of sample difficulty.

Your task:
(1) RE-TEST the three competing claims—RL as pure activation; RL as genuine discovery on complex domains; RL as fragile to regime shift—against any models, training methods, or evaluations released in the last 6 months. Which constraint has been relaxed (e.g., by scaling, curricula, or hybrid mid-training)? Which still holds? Separate the durable question (is the discovery domain-conditional?) from perishable limitations (does prolonged RL always require low KL drift?).
(2) Surface the strongest *contradicting* or *superseding* work from late 2026 onward. Does anything overturn the two-phase dynamic or the diversity-compression trade-off?
(3) Propose two research questions that assume the regime may have shifted: e.g., Does instruction-tuned mid-training obviate the need for prolonged RL to reach the exploration phase? Can diversity preservation during late-phase planning unlock strategies on harder domains?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does prolonged RL discover strategies absent from any base model sample?

Sources 10 notes

Next inquiring lines