Does RL post-training create reasoning or just deploy it?

Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

Post angle — Medium/LinkedIn

The dominant story: DeepSeek R1, GPT-o1, and their successors acquire reasoning capability through RL post-training. RL teaches models to think step-by-step, to backtrack, to verify — capabilities they didn't have before.

The emerging counter-evidence is striking. A hybrid model using a base model's weights with a thinking model's deployment decisions — zero weight updates — recovers 91% of the performance gap to thinking models by steering only 12% of tokens. Base models already spontaneously produce reasoning traces identical to thinking model traces when sampled sufficiently. Single-problem CFT achieves RLVR-level reasoning gains. Activation-space vectors encoding "backtracking" and "uncertainty estimation" already exist in base model hidden states before any RL.

The reframe: pre-training is when reasoning capability is acquired; RL post-training teaches when to deploy it.

This is not a trivial distinction. "When" training is cheaper, less data-hungry, and less fragile than "how" training. If capability already exists, elicitation methods (structured tool-calling, steering vectors, targeted fine-tuning on single problems) become much more attractive than full RL pipelines.

The hook for readers: "We've been crediting the locksmith for the key."

Connections: Does RL teach reasoning or just when to use it?, Do base models already contain hidden reasoning ability?, Can modular cognitive tools unlock reasoning without training?

Inquiring lines that use this note as a source 151

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

19 direct connections · 190 in 2-hop network ·dense cluster Open in graph ↗

Does RL post-training create reasoning or just d… Does extended thinking help or hurt model reasonin… Can dialogue planning balance fast responses with … Can models learn when to think versus respond quic… Does reinforcement learning update only a small fr… Can reinforcement learning discover reasoning stra… Does RL training follow a predictable two-phase le…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does extended thinking help or hurt model reasoning? Explores whether activating thinking mode improves reasoning performance, and what role training plays in determining whether extended internal reasoning chains are productive or counterproductive.
extends the "when not how" claim: RL also manages the *quality direction* of thinking, redirecting extended reasoning from unproductive self-doubt toward productive gap analysis in conversational contexts
Can dialogue planning balance fast responses with strategic depth? Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.
dialogue-specific instantiation of "when not how": the policy model has dialogue capabilities from pretraining; the uncertainty-switching mechanism teaches when to deploy deep planning
Can models learn when to think versus respond quickly? Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
the strongest concrete implementation: Thinkless's control token design makes "when not how" architecturally explicit; RL optimizes a single routing token, not reasoning content
Does reinforcement learning update only a small fraction of parameters? Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.
parametric evidence: if RL only touches 5-30% of parameters, the remaining 70-95% already encode the capability; sparse-but-full-rank updates are the physical signature of "when not how"
Can reinforcement learning discover reasoning strategies base models cannot? Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
TENSION: ProRL challenges the "when not how" framing on novel non-overtrained tasks; the resolution may be domain-conditional — timing-only on overtrained domains, genuine capability creation on novel tasks with sufficient RL duration
Does RL training follow a predictable two-phase learning sequence? This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
deepens: the two-phase dynamic decomposes "when" into a temporal structure — execution tokens are "how" (learned first), planning tokens are "when" (learned second); the "when not how" thesis applies specifically to the planning-token phase

Does RL post-training create reasoning or just deploy it?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 5