SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can reinforcement learning discover reasoning strategies base models cannot?

Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

A fundamental debate in RL for reasoning: does RL truly expand capabilities, or does it merely optimize sampling efficiency over solutions already embedded in the base model? Several studies argued for the latter — since Does RLVR actually expand what models can reason about?, pass@k analysis showed base model performance eventually surpassing RL-trained models as k increases. ProRL directly challenges this conclusion.

The challenge is methodological, not philosophical. ProRL identifies two limitations in prior studies: (1) overreliance on mathematics, where models are already overtrained during pre-training and post-training, restricting exploration potential; and (2) premature termination of RL training before models can fully explore novel reasoning capabilities. The solution: KL divergence control to prevent collapse, reference policy resetting to maintain exploration, and a diverse suite of tasks beyond math.

The result is striking. RL-trained models consistently outperform base models across a wide range of Pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. This is the critical distinction — not just better sampling efficiency, but access to solution strategies that the base model literally cannot produce at any k.

However, this finding exists in tension with the existing insight that Does RL teach reasoning or just when to use it?. The resolution may be domain-conditional: on overtrained domains (mathematics, coding), where base models have been extensively exposed during pre-training, RL primarily teaches timing and selection. On genuinely novel reasoning tasks, where base models lack established solution patterns, sufficiently prolonged RL can expand the capability frontier.

This has practical implications for how long to run RL training and on what tasks. If the goal is genuinely new reasoning capabilities rather than just better deployment of existing ones, RL must be applied to diverse, non-overtrained domains with sufficient training duration.

Inquiring lines that use this note as a source 28

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 126 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

prolonged rl discovers genuinely novel reasoning strategies inaccessible to base models even under extensive sampling