Can reinforcement learning discover reasoning strategies base models cannot?
Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
A fundamental debate in RL for reasoning: does RL truly expand capabilities, or does it merely optimize sampling efficiency over solutions already embedded in the base model? Several studies argued for the latter — since Does RLVR actually expand what models can reason about?, pass@k analysis showed base model performance eventually surpassing RL-trained models as k increases. ProRL directly challenges this conclusion.
The challenge is methodological, not philosophical. ProRL identifies two limitations in prior studies: (1) overreliance on mathematics, where models are already overtrained during pre-training and post-training, restricting exploration potential; and (2) premature termination of RL training before models can fully explore novel reasoning capabilities. The solution: KL divergence control to prevent collapse, reference policy resetting to maintain exploration, and a diverse suite of tasks beyond math.
The result is striking. RL-trained models consistently outperform base models across a wide range of Pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. This is the critical distinction — not just better sampling efficiency, but access to solution strategies that the base model literally cannot produce at any k.
However, this finding exists in tension with the existing insight that Does RL teach reasoning or just when to use it?. The resolution may be domain-conditional: on overtrained domains (mathematics, coding), where base models have been extensively exposed during pre-training, RL primarily teaches timing and selection. On genuinely novel reasoning tasks, where base models lack established solution patterns, sufficiently prolonged RL can expand the capability frontier.
This has practical implications for how long to run RL training and on what tasks. If the goal is genuinely new reasoning capabilities rather than just better deployment of existing ones, RL must be applied to diverse, non-overtrained domains with sufficient training duration.
Inquiring lines that use this note as a source 28
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can RLVR expand a model's reasoning capabilities beyond its training ceiling?
- Why do current RLVR methods fail to expand reasoning capability beyond base model boundaries?
- Can RL teach when to use reasoning versus when to respond directly?
- How does RL refine reasoning paths without simply adding model capability?
- Does RL refine existing knowledge or discover entirely new capabilities?
- What limits RL's ability to scale for reasoning at training time?
- Can RL training teach models when to activate reasoning versus when to skip it?
- Does negative reinforcement alone achieve what full RL training accomplishes?
- Can extended RL training unlock genuinely new reasoning strategies models cannot discover otherwise?
- What distinguishes RL that creates new capabilities from RL that merely teaches timing?
- Does RL training actually restore the critical thinking that reasoning models lose?
- Does RLVR expand model capability or reorganize existing capability?
- Does RL teach models when to use reasoning or how to reason?
- How do RL training and base models differ in creating MI peaks?
- Why does prolonged RL discover strategies absent from any base model sample?
- Can one training example activate mathematical reasoning in RL-trained models?
- How do self-evolving curricula help RL break beyond base model capability boundaries?
- Does RL amplify existing reasoning or create genuinely new computational strategies?
- Can one training example activate mathematical reasoning without reinforcement learning?
- Does RL training activate latent meta-learning capacity or create it from scratch?
- What training duration is actually needed for RL to expand capabilities?
- Does RL primarily teach when to use reasoning or how to reason?
- Can the exploration ceiling be raised beyond what pretraining established?
- What does RL post-training actually teach reasoning systems?
- When does RL discover genuinely novel reasoning strategies versus timing optimization?
- How does pretraining determine what RL can later teach a model?
- Can RL create new reasoning primitives that pretraining never established?
- How do extrapolative and contextual generalization measure RL reasoning gains?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
TENSION: ProRL challenges this claim on novel (non-math) tasks
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
partially challenged: true for overtrained domains, not for genuinely novel tasks
-
Can simple rewards alone teach complex domain reasoning?
Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
supports: prolonged training is the condition under which emergence happens
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
partially challenged: ProRL shows the ceiling can be raised with sufficient training duration and diversity
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
direct tension: pass@k analysis shows RLVR narrows boundaries, but ProRL with sufficient duration and diversity on non-overtrained domains expands them; the resolution is domain-conditional
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Eliciting Reasoning in Language Models with Cognitive Tools
- The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
Original note title
prolonged rl discovers genuinely novel reasoning strategies inaccessible to base models even under extensive sampling