Does RL post-training create reasoning or just deploy it?
Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
Post angle — Medium/LinkedIn
The dominant story: DeepSeek R1, GPT-o1, and their successors acquire reasoning capability through RL post-training. RL teaches models to think step-by-step, to backtrack, to verify — capabilities they didn't have before.
The emerging counter-evidence is striking. A hybrid model using a base model's weights with a thinking model's deployment decisions — zero weight updates — recovers 91% of the performance gap to thinking models by steering only 12% of tokens. Base models already spontaneously produce reasoning traces identical to thinking model traces when sampled sufficiently. Single-problem CFT achieves RLVR-level reasoning gains. Activation-space vectors encoding "backtracking" and "uncertainty estimation" already exist in base model hidden states before any RL.
The reframe: pre-training is when reasoning capability is acquired; RL post-training teaches when to deploy it.
This is not a trivial distinction. "When" training is cheaper, less data-hungry, and less fragile than "how" training. If capability already exists, elicitation methods (structured tool-calling, steering vectors, targeted fine-tuning on single problems) become much more attractive than full RL pipelines.
The hook for readers: "We've been crediting the locksmith for the key."
Connections: Does RL teach reasoning or just when to use it?, Do base models already contain hidden reasoning ability?, Can modular cognitive tools unlock reasoning without training?
Inquiring lines that use this note as a source 151
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does AI knowledge precede actual expertise in hyperreal production?
- How does baseline capability level affect RL improvement ceiling?
- Can RLVR expand a model's reasoning capabilities beyond its training ceiling?
- Can latent reasoning architectures work as retrofits to existing models?
- Why do current RLVR methods fail to expand reasoning capability beyond base model boundaries?
- Why must procedural skills consolidate before strategic reasoning can develop?
- How does non-reasoning SFT prevent overfitting before RL training begins?
- Can a single SAE feature control reasoning behavior across model families?
- What makes reasoning capability a pre-training rather than post-training phenomenon?
- What makes bilevel metacognition architectural rather than emergent in current systems?
- How much do mechanistic interpretability findings reflect true reasoning architecture?
- Why does reasoning fine-tuning reduce model abstention capacity by 24 percent?
- How much does training data format shape what reasoning strategy emerges?
- Why does training format shape reasoning strategy more than domain?
- Why does training data format shape reasoning strategy more than domain content?
- How does business logic specification replace annotated training datasets?
- How should guidance levels adapt as the model's capability boundary shifts?
- Why does fine-tuning degrade reasoning quality even as accuracy improves?
- Does policy entropy collapse represent the main bottleneck in reasoning-focused RL scaling?
- What does an intermediate interface between planning and grounding actually look like?
- Can reinforcement learning add missing domain knowledge to fine-tuned reasoning models?
- Does supervised fine-tuning improve accuracy while damaging the quality of reasoning?
- How much reasoning catalyst data is actually needed for improvement?
- Does reasoning fine-tuning actually reduce a model's ability to abstain?
- Do models trained for safety over-refuse compared to models trained for reasoning?
- Do emergent abilities result from genuine new capabilities or implicit in-context learning?
- Can RL teach when to use reasoning versus when to respond directly?
- What makes training data quality more important than quantity for reasoning?
- Can frozen world models from training cutoff remain adequate for real-world reasoning?
- How does training format shape reasoning strategy more than content?
- Does reasoning trace style explain why RL post-training improves model reasoning?
- Can prompt optimization or fine-tuning inject knowledge models do not already contain?
- How does inductive reasoning from partial evidence enable hypothesis formation?
- Can suppressing incorrect behavior alone solve the diversity bottleneck in reasoning RL?
- Why does imitation learning create a ceiling for reasoning capability?
- Can targeted activation steering surface latent reasoning in base models?
- What makes reasoning-specific post-training different from standard parameter scaling?
- How does RL refine reasoning paths without simply adding model capability?
- Why does the gap between theoretical expressiveness and learned capability matter?
- Does fine-tuning actually change model capabilities or only output distribution?
- How does the functional separation of knowledge and reasoning affect adaptation methods?
- Can RL-trained meta-agents match or exceed manually designed workflows?
- What happens when reasoning fine-tuning eliminates model refusal mechanisms entirely?
- What separates knowledge from reasoning in neural network layers?
- What role does inductive bias play versus model capacity in practice?
- What role does curriculum design play in reasoning emergence?
- Why does combining reasoning distillation with RLVR outperform either training stage alone?
- Can smaller models achieve domain expertise through focused RL training?
- Does RL refine existing knowledge or discover entirely new capabilities?
- Why does supervised fine-tuning degrade reasoning quality despite raising accuracy?
- How does RL compress reasoning path diversity during training?
- What makes software engineering environments better suited for RL than other interactive domains?
- What limits RL's ability to scale for reasoning at training time?
- Do thought anchors correspond mechanistically to planning tokens in RL?
- Why does reasoning graph topology evolve differently across training phases?
- Which recipe choices determine the asymptotic ceiling in RL training?
- Can RL training teach models when to activate reasoning versus when to skip it?
- What happens to model reasoning when policy entropy collapses during RL?
- Why does RL improve sampling efficiency but not expand capability boundaries?
- How does a single training example trigger phase transitions in reasoning output?
- Can extended RL training unlock genuinely new reasoning strategies models cannot discover otherwise?
- What distinguishes RL that creates new capabilities from RL that merely teaches timing?
- Are RLVR models worse than non-reasoning models for subjective annotation?
- Do reasoning architectures and role-playing objectives fundamentally conflict?
- How does pretrained knowledge constrain what adaptation strategies can achieve?
- Does reasoning fine-tuning actually damage a model's ability to abstain?
- What changes when reasoning models adopt trajectory-response output formats?
- Does reinforcement learning preserve reasoning quality better than supervised fine-tuning?
- Does this reasoning steering method work consistently across all model sizes?
- How does correctness emergence occur when no expert initially solved the task?
- What role does self-learning play in improving agent reasoning without annotation?
- Does reasoning fine-tuning actually harm a model's ability to abstain?
- Can reasoning catalyst data serve as a stable foundation for test-time training?
- Does RL training actually restore the critical thinking that reasoning models lose?
- How much reasoning depth do we actually need for most real-world tasks?
- Can reasoning fine-tuning improve both capability and instruction compliance together?
- Why does reasoning fine-tuning reduce a model's ability to abstain?
- How does model weight freezing across users affect virtual instance individuation?
- How does training data format shape which reasoning patterns emerge in models?
- Does RLVR expand model capability or reorganize existing capability?
- Does RL teach models when to use reasoning or how to reason?
- How do RL training and base models differ in creating MI peaks?
- Why does training data format shape reasoning strategy more than content?
- How does trajectory burstiness compare to other structural properties that shape emergent capabilities?
- Can knowledge encoded in model representations fail to influence generation?
- How should humans specify deterministic abstractions of RL problems?
- Why does prolonged RL discover strategies absent from any base model sample?
- Can one training example activate mathematical reasoning in RL-trained models?
- How do self-evolving curricula help RL break beyond base model capability boundaries?
- Can reinforcement learning fix the reasoning gaps that supervised fine-tuning misses?
- What structural differences emerge between early generic skills and later meta-strategy skills?
- Can training format itself shape what reasoning strategy a model learns?
- How does routing decide between models before generation happens?
- Does RL amplify existing reasoning or create genuinely new computational strategies?
- Why does RL behavior differ between standard reasoning tasks and complex planning domains?
- How does making implicit reasoning requirements explicit change model performance?
- Can process supervision improve agentic RL through meta-reasoning rewards?
- Why does a replay mechanism prevent reasoner skills from over-specializing?
- Can pretraining signals unlock latent reasoning that post-training merely activates?
- Why does eliminating proxy-model filtering improve reasoning emergence in pretraining?
- What happens to base model capabilities when you apply finetuning?
- Do base models truly possess latent reasoning capability?
- Does latent reasoning capability exist in base models before any training?
- Does training data format shape reasoning strategy more than domain content?
- How does backward reasoning during training improve forward reasoning capability?
- Can models develop situational awareness without explicit training for it?
- Can you steer reasoning by directly manipulating SAE features?
- Can attractor dynamics compete with input-based probing for characterizing model knowledge?
- Does RL training activate latent meta-learning capacity or create it from scratch?
- How do verifier-free and adversarial approaches compare in extending reasoning RL?
- What scaling properties emerge from RL training dynamics beyond verification?
- How much does pretraining quality affect the modularity of fine-tuned models?
- What distinguishes genuine capability gains from coherent but invalid reasoning traces?
- Can smaller amounts of diverse reasoning demonstrations replace exhaustive factual training data?
- What training duration is actually needed for RL to expand capabilities?
- Does RL primarily teach when to use reasoning or how to reason?
- Can the exploration ceiling be raised beyond what pretraining established?
- What does RL post-training actually teach reasoning systems?
- How should abstraction preserve applicability conditions when distilling experience?
- What happens to representational structure during model pretraining phases?
- How does RPT compare to learning when versus how to deploy reasoning?
- Why does policy entropy collapse when scaling RL for reasoning?
- Does the base model already contain latent reasoning capability?
- What makes supervised fine-tuning worsen RL exploration later?
- Does the pretrained model prior limit RL search capability more than the optimization algorithm itself?
- Why do knowledge and reasoning train in different network layers?
- Can specialized components replace single fully-trained models in deployment?
- Does RLVR teach new reasoning or activate existing pretraining capabilities?
- What pretraining formats encode latent reasoning strategies that RLVR can surface?
- Why does pre-training provide the raw material for emergent thinking?
- How much does training data format influence reasoning strategy versus domain content?
- What mechanisms activate latent reasoning capabilities already present in base models?
- How does training data structure shape reasoning strategy more than domain content?
- Why does extended reasoning training improve exploration without adding new capabilities?
- Can base models spontaneously produce reasoning traces without any RL training?
- When does RL discover genuinely novel reasoning strategies versus timing optimization?
- Can single-problem fine-tuning match full RL pipeline reasoning gains?
- Can approximate or noisy reference answers work for RL-based reasoning training?
- How does pretraining determine what RL can later teach a model?
- What makes a task at the edge of competence optimal for RL?
- Can RL create new reasoning primitives that pretraining never established?
- How do extrapolative and contextual generalization measure RL reasoning gains?
- Can structured workflows unlock latent reasoning abilities that raw models don't show?
- Do base models already contain latent behavioral principles waiting to be amplified?
- Why does reasoning fine-tuning reduce models' ability to abstain?
- How do finetuning and pretraining improvements differ in their effects on model capabilities?
- Does task diversity in pretraining data transfer reasoning better than larger models?
- How does early commitment in reasoning differ from early exploitation in planning?
- Does targeting the edge of competence during RL pretraining unlock true reasoning gains?
- What latent reasoning capability do base models already possess before training?
- Does finetuning facts into weights overwrite existing model capabilities?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does extended thinking help or hurt model reasoning?
Explores whether activating thinking mode improves reasoning performance, and what role training plays in determining whether extended internal reasoning chains are productive or counterproductive.
extends the "when not how" claim: RL also manages the *quality direction* of thinking, redirecting extended reasoning from unproductive self-doubt toward productive gap analysis in conversational contexts
-
Can dialogue planning balance fast responses with strategic depth?
Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.
dialogue-specific instantiation of "when not how": the policy model has dialogue capabilities from pretraining; the uncertainty-switching mechanism teaches when to deploy deep planning
-
Can models learn when to think versus respond quickly?
Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
the strongest concrete implementation: Thinkless's control token design makes "when not how" architecturally explicit; RL optimizes a single routing token, not reasoning content
-
Does reinforcement learning update only a small fraction of parameters?
Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.
parametric evidence: if RL only touches 5-30% of parameters, the remaining 70-95% already encode the capability; sparse-but-full-rank updates are the physical signature of "when not how"
-
Can reinforcement learning discover reasoning strategies base models cannot?
Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
TENSION: ProRL challenges the "when not how" framing on novel non-overtrained tasks; the resolution may be domain-conditional — timing-only on overtrained domains, genuine capability creation on novel tasks with sufficient RL duration
-
Does RL training follow a predictable two-phase learning sequence?
This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
deepens: the two-phase dynamic decomposes "when" into a temporal structure — execution tokens are "how" (learned first), planning tokens are "when" (learned second); the "when not how" thesis applies specifically to the planning-token phase
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Eliciting Reasoning in Language Models with Cognitive Tools
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning
- Base Models Know How to Reason, Thinking Models Learn When
Original note title
thinking models learn when not how — the case that rl post-training is a deployment optimizer not a capability creator