When does RL actually extend reasoning beyond pretraining?

Does reinforcement learning genuinely expand a model's reasoning capabilities, or does it merely improve sampling from existing knowledge? This question hinges on whether pretraining provides sufficient foundation and whether RL targets tasks within reach.

Synthesis note · 2026-06-03 · sourced from Reasoning Architectures

Whether RL post-training extends reasoning beyond what pretraining gave, or merely refines sampling, is one of the field's live disputes — and the disagreement persists because modern pipelines are uncontrolled (opaque pretraining corpora, under-examined mid-training, RL interacting with unknown priors). This paper builds a fully controlled framework — synthetic reasoning tasks with explicit atomic operations, parseable step traces, and systematic manipulation of training distributions — to isolate each stage's causal contribution, evaluating extrapolative generalization (harder compositions) and contextual generalization (new surface contexts).

The reconciliation is precise: RL produces true capability gains (measured at pass@128, not just pass@1) only when two conditions hold — pretraining leaves sufficient headroom, and RL data targets the model's edge of competence (difficult but not out of reach). When pretraining already established the reasoning primitives, RL's job is to extend their composition; when it didn't, RL cannot conjure them.

This is the controlled-experiment capstone for the vault's RLVR-capability cluster. It sharpens Does RL teach reasoning or just when to use it? and Why does RLVR work with completely random rewards? by adding the headroom + edge-of-competence conditions under which RL genuinely extends (not just samples) capability — and it gives actionable guidance for data curricula and compute allocation across stages.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 123 in 2-hop network ·medium cluster Open in graph ↗

When does RL actually extend reasoning beyond pr… Does RL teach reasoning or just when to use it? Why does RLVR work with completely random rewards? Do base models already contain hidden reasoning ab…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
this adds the controlled-experiment conditions under which RL extends vs merely activates
Why does RLVR work with completely random rewards? RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.
both bound what RL actually contributes; this specifies the pretraining-headroom precondition
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the headroom condition is the flip side: RL can only extend primitives pretraining already laid down

When does RL actually extend reasoning beyond pretraining?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4