Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
The strongest empirical challenge to the "RL teaches reasoning" narrative comes from pass@k analysis. At small k (e.g., k=1), RLVR models outperform their base models — they produce correct answers more reliably on any given attempt. But as k increases, base models consistently surpass RLVR models across all benchmarks and model families. The reasoning paths that RLVR models generate are already present in the base model's sampling distribution.
This reframes what RLVR actually does. Rather than expanding the frontier of solvable problems, RLVR narrows the sampling distribution toward correct solutions that were already accessible. The model learns to find correct paths more efficiently, not to reason in fundamentally new ways. Manual inspection confirms: for most problems where RLVR models succeed, the base model can produce at least one correct chain-of-thought.
Six popular RLVR algorithms (including GRPO, PPO variants) perform similarly and all remain far from optimal in leveraging the base model's potential — they converge on similar subsets of the base model's capability space. This suggests the bottleneck is not algorithmic but structural: on-policy RL with verifiable rewards optimizes sampling, not capability.
The contrast with distillation is sharp. Distillation from a stronger teacher can transfer genuinely new reasoning patterns, expanding the student's reasoning scope beyond what the base model could sample. Since Does RL teach reasoning or just when to use it?, the RLVR finding fits: activation is not creation. But distillation is creation — it writes new patterns into the model's distribution.
The practical implication: if you need capabilities the base model doesn't have, distillation from a stronger model is the path. If the base model can already solve the problem (given enough samples), RLVR makes it reliable. These are different tools for different gaps.
Inquiring lines that use this note as a source 100
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does good simulation eventually count as genuine realization?
- Why does binary reward forcing degrade model calibration?
- Can checklist-based rewards fix judgment problems in RL training?
- What makes some tasks bounded enough for reliable RL?
- Do spurious rewards activate reasoning without teaching new skills?
- How much RLVR improvement comes from benchmark data memorization?
- Can clean benchmarks reveal true RLVR reasoning gains?
- Can RLVR expand a model's reasoning capabilities beyond its training ceiling?
- Why do current RLVR methods fail to expand reasoning capability beyond base model boundaries?
- Can outcome-based rewards fully replace per-step likelihood in diffusion RL training?
- What information do next-state signals contain beyond what scalar rewards capture?
- Why does early experience provide better warm-starts for downstream reinforcement learning?
- Can reinforcement learning add missing domain knowledge to fine-tuned reasoning models?
- Can RL teach when to use reasoning versus when to respond directly?
- Why does a relativistic critic outperform absolute scoring in adversarial reasoning training?
- How does the expert demonstration ceiling compare to the generation-verification gap bound?
- What stability techniques prevent collapse in policy-critic adversarial training?
- How does reinforcement learning compare to differentiable joint training for RAG?
- How do probability-based rewards compare to self-consistency as training signals for reasoning?
- At what capability level does the generation-verification gap make intrinsic rewards insufficient?
- Does reinforcement learning learn optimal per-turn reasoning discipline?
- What distinguishes verifiable rewards from preference-based rewards in unified training?
- Is elaborate reward shaping necessary if the pretrained prior already contains good solutions?
- Can UCB-style bonuses over outcome space prevent policy entropy collapse?
- How do Q-value models improve action selection compared to value models?
- Can model confidence signals replace explicit external reward functions?
- Can RL with verifiable rewards improve dialogue quality better than preference optimization?
- How does RL refine reasoning paths without simply adding model capability?
- How does the pretrained prior set a capability ceiling for reward model exploration?
- Does RL refine existing knowledge or discover entirely new capabilities?
- What happens to model reasoning when policy entropy collapses during RL?
- Can judges trained on both verifiable and non-verifiable tasks transfer across domains?
- Why does RL improve sampling efficiency but not expand capability boundaries?
- Can extended RL training unlock genuinely new reasoning strategies models cannot discover otherwise?
- What distinguishes RL that creates new capabilities from RL that merely teaches timing?
- Does RLVR reward structure create pressure toward traces that look right?
- Can proper scoring rules fix RLVR's degradation on disagreement prediction?
- Are RLVR models worse than non-reasoning models for subjective annotation?
- What role do high-entropy minority tokens play in RLVR?
- What limits RLVR effectiveness beyond mathematical and coding domains?
- Can models learn both what and how to study through reinforcement learning?
- Does the generation-verification gap actually limit self-improvement in verifiable tasks?
- Does reinforcement learning preserve reasoning quality better than supervised fine-tuning?
- What makes abstention a learnable behavior instead of a default penalty?
- Can reinforcement learning teach AI when to ask clarifying questions?
- Does RLVR expand model capability or reorganize existing capability?
- Does RL teach models when to use reasoning or how to reason?
- Do high-entropy RLVR tokens correspond to MI-peak tokens during inference?
- Can reward design fix the conflict between reasoning accuracy and abstention calibration?
- Why does prolonged RL discover strategies absent from any base model sample?
- How do self-evolving curricula help RL break beyond base model capability boundaries?
- How does Supervised RL bridge the gap between SFT and RLVR?
- When does outcome reward signal become informative during model training?
- Can reinforcement learning fix the reasoning gaps that supervised fine-tuning misses?
- What reward mechanisms make thinking-based compression budget-controllable and reliable?
- Does RL amplify existing reasoning or create genuinely new computational strategies?
- Can process supervision improve agentic RL through meta-reasoning rewards?
- How does belief-shift reward compare to curiosity-driven and process reward approaches?
- Why does belief-shift reward enable smaller models to match larger baselines?
- How do dense token-level rewards compare to sparse task-level verification signals?
- How do reward signals in RLVR interact with pretraining biases?
- Why do reward models fail to recognize genuinely different valid answers?
- How does 93% reward reliability compare to other RL noise sources?
- How much data do generative process reward models actually need?
- Does reinforcement learning teach models how to reason or when to reason?
- How do verifier-free and adversarial approaches compare in extending reasoning RL?
- What scaling properties emerge from RL training dynamics beyond verification?
- Can reinforcement learning close the gap between LLM reasoning and action?
- What makes exploration and reflection rewards verifiable in agentic environments?
- What training duration is actually needed for RL to expand capabilities?
- What does RL post-training actually teach reasoning systems?
- Does outcome-based reinforcement learning improve explanation faithfulness?
- What other downstream metrics could serve as RL reward sources?
- Can verifiable rewards during pretraining replace costly human preference labeling?
- Can verifier-free RL work without manual preference labels or task-specific training?
- Why does prompting discover capabilities that need reward-driven refinement?
- How can verifier-free reinforcement learning handle reasoning without task-specific checks?
- Are different reward signal sources substitutable in verifier-free RL?
- Why do six different RLVR algorithms converge on similar performance levels?
- How does prolonged RL training differ from standard RLVR approaches?
- Can early experience replace external rewards as a learning signal?
- Why do certain tokens at certain difficulties drive most of RLVR's learning signal?
- How do verifier-free RL patterns differ from traditional RLHF approaches?
- Does RLVR teach new reasoning or activate existing pretraining capabilities?
- What pretraining formats encode latent reasoning strategies that RLVR can surface?
- Can structured rewards still teach models when spurious rewards also work?
- Does careful reward engineering matter if pretraining determines RLVR effectiveness?
- Can held-out validation gates prevent optimizer hallucinations in skill proposals?
- Why does reinforcement learning training degrade model calibration?
- What makes reward models fundamentally different from policy discriminators?
- Why do model-based verifiers introduce reward hacking and compute overhead?
- What makes reward signal sources substitutable across verifier-free RL patterns?
- What makes a task at the edge of competence optimal for RL?
- Can RL create new reasoning primitives that pretraining never established?
- How do extrapolative and contextual generalization measure RL reasoning gains?
- What makes user-decision rewards better than model-confidence rewards?
- When does reinforcement learning actually produce true reasoning gains in models?
- Can models trained with RL on pretraining data avoid reward hacking seen in RLHF?
- Does the generation-verification gap define where self-rewarding actually works?
- What makes exploration a verifiable and measurable training objective?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
RLVR confirms the timing-not-capability thesis with pass@k evidence
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
RLVR finding shows the latent capability is an upper bound, not a floor
-
Can reinforcement learning discover reasoning strategies base models cannot?
Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
tension: this claims RL does expand boundaries under prolonged training
-
Can simple rewards alone teach complex domain reasoning?
Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
emergence may operate at a different level than sampling efficiency
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- Absolute Zero: Reinforced Self-play Reasoning with Zero Data
- Escaping the Verifier: Learning to Reason via Demonstrations
- Spurious Rewards: Rethinking Training Signals in RLVR
- The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Original note title
rlvr does not expand reasoning capability boundaries beyond the base model — it improves sampling efficiency within existing boundaries