Why does RLVR training narrow a model's problem solving ability?
RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.
RLVR faces a fundamental challenge: the solution space of LLMs is so vast and sparse that current techniques cannot guide effective exploration of unknown pathways. Long reasoning tasks are especially vulnerable — a single erroneous step nullifies the reward for the entire trajectory, failing to provide any positive signal for acquiring new knowledge. The result is "capability boundary collapse": the model's exploratory range contracts, and its problem-solving scope narrows rather than expands.
The mechanism parallels an educational insight: a model that only "thinks" (exploits internal knowledge) without "learning" (exploring external knowledge) will be "in peril." RLVR excels at inward exploitation — refining and optimizing already-known reasoning methods — but demonstrates inadequacy in outward exploration — discovering reasoning paths that the current policy assigns low probability to.
RL-PLUS addresses this with two components. Multiple Importance Sampling combines information from multiple policies to provide low-variance, unbiased reward estimation from external data — avoiding both the systematic bias of on-policy approaches and the high variance of naive off-policy corrections. An Exploration-Based Advantage Function reshapes the learning objective by up-weighting advantages for reasoning paths that are correct but have low probability under the current policy — explicitly incentivizing discovery of valuable information the model would typically overlook.
Since Does policy entropy collapse limit reasoning performance in RL?, capability boundary collapse is the downstream consequence of entropy collapse at the task-capability level. Entropy collapse constrains the token-level distribution; capability boundary collapse constrains the problem-level distribution. Both are about the same fundamental dynamic: optimization pressure narrows the space faster than exploration can maintain it.
Since Why do specialized models fail outside their domain?, capability boundary collapse is the RL-specific mechanism behind domain capability cliffs. The model doesn't just specialize — it actively loses the ability to generalize.
The Invisible Leash: formal constraint. "The Invisible Leash" provides the theoretical grounding: RLVR is constrained by the base model's support — unable to sample solutions with zero initial probability — and operates as a conservative reweighting mechanism that restricts discovery of entirely original solutions. The entropy-reward tradeoff is formalized: while RLVR reliably enhances pass@1 precision, the shrinkage of empirical support generally outweighs the expansion under larger sampling budgets. A subtle finding: RLVR sometimes increases token-level entropy (greater uncertainty at each generation step) while decreasing answer-level entropy (convergence onto fewer distinct answers). These seemingly more uncertain paths ultimately converge onto a smaller set of solutions. Breaking this invisible leash requires explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.
Inquiring lines that use this note as a source 18
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can clean benchmarks reveal true RLVR reasoning gains?
- Can RLVR expand a model's reasoning capabilities beyond its training ceiling?
- Why do current RLVR methods fail to expand reasoning capability beyond base model boundaries?
- Why does combining reasoning distillation with RLVR outperform either training stage alone?
- What limits RL's ability to scale for reasoning at training time?
- Does sparsity in RL arise from training on policy-distribution data?
- Does RLVR reward structure create pressure toward traces that look right?
- Are RLVR models worse than non-reasoning models for subjective annotation?
- What role do high-entropy minority tokens play in RLVR?
- What limits RLVR effectiveness beyond mathematical and coding domains?
- Does RLVR expand model capability or reorganize existing capability?
- How does Supervised RL bridge the gap between SFT and RLVR?
- Why does medium difficulty outperform both easy and hard RLVR training samples?
- What's the difference between RLHF, RLVR, and RLCF as training paradigms?
- Why do six different RLVR algorithms converge on similar performance levels?
- How does prolonged RL training differ from standard RLVR approaches?
- Why do certain tokens at certain difficulties drive most of RLVR's learning signal?
- Does RLVR teach new reasoning or activate existing pretraining capabilities?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
capability boundary collapse is the task-level manifestation of token-level entropy collapse
-
Why do specialized models fail outside their domain?
Deep domain optimization creates sharp performance cliffs at domain boundaries. Specialized models generate plausible-sounding but ungrounded responses when queries fall outside their training scope, and often fail to signal their own ignorance.
capability boundary collapse is the RL-specific mechanism
-
Does outcome-based RL diversity loss spread across unsolved problems?
When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?
same dynamic: solved-problem optimization narrows unsolved-problem capability
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
boundary collapse explains why RL teaches timing: it can only refine what's already there
-
Does RL training collapse format diversity in pretrained models?
Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
the format-level selection mechanism that precedes capability boundary collapse: RL first selects a dominant pretraining distribution format, then narrows further within that format — format selection is the macro-level collapse, capability boundary contraction is the micro-level consequence
-
Is the exploration-exploitation trade-off actually fundamental?
Token-level analysis suggests exploration and exploitation are opposed, but does hidden-state analysis reveal they could coexist? Understanding measurement granularity's role in perceived trade-offs matters for scaling reasoning systems.
reframes boundary collapse: if exploration and exploitation are orthogonal at the hidden-state level, capability boundary collapse may reflect a token-level measurement artifact rather than a fundamental constraint; VERL's dual-channel approach offers an alternative to RL-PLUS's external data integration
-
Do overly hard RLVR samples actually harm model capabilities?
Explores whether training on problems beyond a model's competence band causes active regression rather than mere learning failures. Investigates whether group-relative normalization amplifies accidental successes into harmful shortcuts.
grounds: over-hard samples are one driver of the capability-boundary collapse at scale
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example
- Look Before You Leap: Autonomous Exploration for LLM Agents
- GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning
- Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
Original note title
capability boundary collapse in rlvr narrows the models problem-solving scope — external data integration via importance sampling counteracts it