SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Agentic Systems and Tool Use

Does reinforcement learning create new reasoning abilities or activate existing ones?

RL post-training might either unlock latent capabilities in base models or genuinely create novel strategies. Understanding which happens under what conditions clarifies how to invest in model training effectively.

Synthesis note · 2026-02-23 · sourced from Reasoning Architectures
Do reasoning traces show how models actually think? What actually changes inside a model during RL training?

Two prominent claims about what RL post-training does appear contradictory:

The timing thesis: Since Does RL teach reasoning or just when to use it? and Do base models already contain hidden reasoning ability?, RL functions as a deployment optimizer. Evidence: base models outperform RLVR-trained models at high pass@k, RL-trained models show the same solution strategies as base models, and Can a single training example unlock mathematical reasoning?.

The capability thesis: Can reinforcement learning discover reasoning strategies base models cannot?. Evidence: ProRL shows strategies absent from any base model sample regardless of budget, while self-evolving curriculum RL breaks the boundary constraints identified by pass@k analysis (where Does RLVR actually expand what models can reason about?).

The domain-conditional resolution: Both are correct under different conditions. For standard math/code reasoning where the problem structure is well-represented in pretraining data, RL activates latent capability (timing thesis). For complex tasks requiring multi-step planning, tool coordination, or novel strategy recombination, RL may create genuinely new capability through prolonged training (capability thesis).

Supporting evidence for the conditional view:

The practical implication: RL training investment should be calibrated to the target domain. For standard reasoning, minimal RL (even one example) suffices. For complex agentic tasks, sustained RL investment with evolving curricula is justified.

Inquiring lines that use this note as a source 27

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

RL capability creation is domain-conditional — standard reasoning activates latent capability while complex planning may generate genuinely novel strategies