Does reinforcement learning create new reasoning abilities or activate existing ones?

RL post-training might either unlock latent capabilities in base models or genuinely create novel strategies. Understanding which happens under what conditions clarifies how to invest in model training effectively.

Synthesis note · 2026-02-23 · sourced from Reasoning Architectures

Two prominent claims about what RL post-training does appear contradictory:

The timing thesis: Since Does RL teach reasoning or just when to use it? and Do base models already contain hidden reasoning ability?, RL functions as a deployment optimizer. Evidence: base models outperform RLVR-trained models at high pass@k, RL-trained models show the same solution strategies as base models, and Can a single training example unlock mathematical reasoning?.

The capability thesis: Can reinforcement learning discover reasoning strategies base models cannot?. Evidence: ProRL shows strategies absent from any base model sample regardless of budget, while self-evolving curriculum RL breaks the boundary constraints identified by pass@k analysis (where Does RLVR actually expand what models can reason about?).

The domain-conditional resolution: Both are correct under different conditions. For standard math/code reasoning where the problem structure is well-represented in pretraining data, RL activates latent capability (timing thesis). For complex tasks requiring multi-step planning, tool coordination, or novel strategy recombination, RL may create genuinely new capability through prolonged training (capability thesis).

Supporting evidence for the conditional view:

RLVR pass@k boundary collapse occurs on standard benchmarks (MATH, GSM8K)
ProRL novel strategy discovery occurs on problems requiring deep planning
SWE-RL doubles baseline on long-horizon engineering tasks — beyond activation
Duration matters: short RLVR narrows boundaries while prolonged RL pushes through them

The practical implication: RL training investment should be calibrated to the target domain. For standard reasoning, minimal RL (even one example) suffices. For complex agentic tasks, sustained RL investment with evolving curricula is justified.

Inquiring lines that use this note as a source 27

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Does reinforcement learning create new reasoning abilities or activate existing ones?

Related papers in this collection 8

Search by related questions 4