Why do current RLVR methods fail to expand reasoning capability beyond base model boundaries?
This explores why reinforcement learning with verifiable rewards (RLVR) mostly sharpens what a model can already do rather than teaching it genuinely new reasoning — and whether that ceiling is fundamental or just a side effect of how RLVR is usually trained.
This question reads as: if RLVR makes models visibly better at reasoning, why does that improvement seem to stop at the edge of what the base model could already do? The corpus answer is surprisingly unanimous on the mechanism, and then splits interestingly on whether the ceiling is real. The core finding is that RLVR doesn't add new reasoning — it reweights sampling toward solutions already living in the base model's distribution. Pass@k analysis is the smoking gun: at high k, base models actually match or beat their RLVR-tuned versions, meaning RLVR narrowed the search rather than widening the space of solvable problems Does RLVR actually expand what models can reason about?. Put differently, RLVR improves sampling efficiency within existing boundaries, and a single training example — or even a spurious reward — can trigger most of the gain for a well-pretrained model What does reward learning actually do to model reasoning?.
The deeper reason is that the capability was never RLVR's to create. Multiple independent probes — RL steering, critique fine-tuning, decoding tweaks, SAE feature steering — all elicit reasoning that already sits latent in base-model activations, so post-training selects rather than builds Do base models already contain hidden reasoning ability?. A sharp reframing makes this concrete: RL teaches a model *when* to deploy reasoning, not *how* to reason. Hybrid models recover 91% of the gains just by routing tokens, and the activation vectors for reasoning strategies exist before any RL touches them Does RL post-training create reasoning or just deploy it?.
But there's an active failure mode layered on top of 'merely doesn't expand' — RLVR can actively *shrink* the boundary. Its on-policy nature rewards exploitation over exploration, collapsing the model toward a narrower set of high-reward paths and abandoning underexplored-but-valuable ones; this is named capability boundary collapse, and it's counteracted by injecting external data and explicitly rewarding discovery Why does RLVR training narrow a model's problem solving ability?. Feeding RLVR problems that are too hard makes it worse: rare accidental successes get treated as high-advantage trajectories under group-relative normalization, so the model learns answer-repetition and computation-skipping shortcuts that contaminate skills it already had Do overly hard RLVR samples actually harm model capabilities?. And even when RLVR looks like it's improving reasoning, what it often improves is *coherence* — smoother transitions between adjacent steps — without guaranteeing the proof is globally valid Does RLVR actually improve mathematical reasoning or just coherence?.
Here's the twist worth knowing: the ceiling may be a property of *how* RLVR is run, not RLVR itself. Prolonged RL — with KL control, periodic policy resetting, and crucially a *diverse* task mix including non-mathematical domains — produces models that beat the base model at *every* pass@k level, exactly the signature that's supposed to be impossible if RLVR only optimizes sampling Can reinforcement learning discover reasoning strategies base models cannot?. The common thread with the failure modes is exploration: collapse happens when training over-exploits a narrow distribution; expansion happens when training is forced to explore domains where the base model has no established pattern to fall back on. This also reframes the benchmark debate — genuine behavioral activation and benchmark gains from data contamination are separable phenomena that can coexist, so a headline score increase doesn't by itself prove the boundary moved Can genuine reasoning activation coexist with contaminated benchmarks?.
If you want to chase the alternatives the corpus hints at: distillation genuinely transfers *new* reasoning patterns where RLVR doesn't Does RLVR actually expand what models can reason about?; verifier-free methods like VeriFree extend the whole approach to general domains by using reference-answer likelihood instead of rule-based checking Can reasoning improvement work without answer verification?; and scaling reasoning in *width* — sampling parallel latent trajectories rather than only deeper chains — sidesteps the question of boundary expansion entirely by covering more of the solution space at once Can reasoning systems scale wider instead of only deeper?.
Sources 11 notes
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.
VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.