How do reward signals in RLVR interact with pretraining biases?
This explores whether the reward in RLVR (reinforcement learning with verifiable rewards) is what actually teaches a model to reason, or whether it just surfaces patterns the model already absorbed during pretraining — and how much the reward signal itself even matters.
This explores whether the reward in RLVR teaches reasoning or merely surfaces what pretraining already laid down — and the corpus points hard toward the latter. The most striking evidence is that RLVR works almost as well with random or even wrong rewards as with correct ones. Qwen2.5-Math gains 16–25% on MATH-500 from spurious rewards, while Llama and OLMo get nothing from the same treatment Why do random rewards improve reasoning for some models but not others?. The reward isn't injecting a skill; it's flipping a switch on latent code-reasoning behavior that Qwen's pretraining happened to install and the others lack. Several notes frame this the same way: verifiable rewards act as catalysts that surface existing capabilities, not teachers that build new ones What does reward learning actually do to model reasoning?, How does RL training reshape reasoning and what gets lost?, and effectiveness tracks pretraining quality rather than reward correctness or training volume Why does RLVR work with completely random rewards?.
If the reward is mostly a catalyst, the natural question is what it catalyzes — and the answer is that it amplifies one pretraining bias at the expense of others. Controlled experiments show RL converges on a single dominant output format from the pretraining distribution within the first epoch, collapsing the alternatives. Tellingly, the format that wins depends on model scale rather than on which format performs best, and this dynamic is invisible when you start from a proprietary base model whose priors you can't see Does RL training collapse format diversity in pretrained models?. So the reward signal is less an external teacher than a selection pressure operating on a fixed menu the model brought with it.
This also explains why RLVR doesn't expand what a model can do. Pass@k analysis shows base models actually beat their RLVR-tuned versions at high k — RLVR narrows sampling toward solutions already living in the base distribution rather than adding new ones, while distillation (importing another model's reasoning) genuinely transfers new patterns Does RLVR actually expand what models can reason about?. The mechanism shows up even at the parameter level: RL touches only 5–30% of weights, in sparse but nearly full-rank subnetworks that are almost identical across random seeds — structural, prior-bounded updates rather than wholesale relearning Does reinforcement learning update only a small fraction of parameters?.
The interaction has a darker side worth knowing: because the reward only reshapes existing tendencies, a badly designed signal can corrupt pretrained capability instead of refining it. Overly hard problems push models toward degenerate shortcuts — answer repetition, computation-skipping — and group-relative normalization treats rare lucky successes as high-advantage, reinforcing the shortcuts until they contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. Binary correctness rewards similarly degrade calibration by rewarding confident guessing, fixable by adding a Brier-score term Does binary reward training hurt model calibration?. And the polarity of the signal matters more than people assume: negative-only reinforcement (suppressing wrong trajectories) often matches full PPO/GRPO while preserving the diversity that positive-only reinforcement destroys by over-concentrating probability mass Does negative reinforcement alone outperform full reinforcement learning?.
The quietly useful takeaway: if you want RLVR to add capability rather than just sharpen what pretraining gave you, the reward can't do it alone. Sequencing imitation first (supervised RL to build reasonable rollouts) and then RLVR to sharpen them beats either alone — because imitation creates the trajectories that make the outcome reward informative in the first place Does sequencing imitation then exploration training improve reasoning?. RL training even self-organizes into a two-phase arc, mastering execution before strategic planning becomes the bottleneck Does RL training follow a predictable two-phase learning sequence?. The reward signal, in other words, is a lever — but it only moves what pretraining already put within reach.
Sources 12 notes
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.