SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals Reasoning, Retrieval, and Evaluation

When does RL actually extend reasoning beyond pretraining?

Does reinforcement learning genuinely expand a model's reasoning capabilities, or does it merely improve sampling from existing knowledge? This question hinges on whether pretraining provides sufficient foundation and whether RL targets tasks within reach.

Synthesis note · 2026-06-03 · sourced from Reasoning Architectures

Whether RL post-training extends reasoning beyond what pretraining gave, or merely refines sampling, is one of the field's live disputes — and the disagreement persists because modern pipelines are uncontrolled (opaque pretraining corpora, under-examined mid-training, RL interacting with unknown priors). This paper builds a fully controlled framework — synthetic reasoning tasks with explicit atomic operations, parseable step traces, and systematic manipulation of training distributions — to isolate each stage's causal contribution, evaluating extrapolative generalization (harder compositions) and contextual generalization (new surface contexts).

The reconciliation is precise: RL produces true capability gains (measured at pass@128, not just pass@1) only when two conditions hold — pretraining leaves sufficient headroom, and RL data targets the model's edge of competence (difficult but not out of reach). When pretraining already established the reasoning primitives, RL's job is to extend their composition; when it didn't, RL cannot conjure them.

This is the controlled-experiment capstone for the vault's RLVR-capability cluster. It sharpens Does RL teach reasoning or just when to use it? and Why does RLVR work with completely random rewards? by adding the headroom + edge-of-competence conditions under which RL genuinely extends (not just samples) capability — and it gives actionable guidance for data curricula and compute allocation across stages.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 123 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

RL produces true reasoning gains only when pretraining leaves headroom and RL data targets the model's edge of competence