How does prolonged RL training differ from standard RLVR approaches?
This explores whether training a model with RL for a long stretch behaves differently from a standard, shorter RLVR (Reinforcement Learning from Verifiable Rewards) run — and the honest answer is that the corpus speaks more to what RLVR does over time than to a clean 'prolonged vs. standard' split.
This reads the question as asking whether extending RL training changes its fundamental character, or just gives you more of the same. The corpus doesn't have a paper that isolates 'prolonged' training as its own variable — but several notes together suggest that the interesting effects of RL show up early and then *compound* rather than transform, which is itself a useful answer.
The most striking finding is how fast RL's signature effects appear. Format diversity collapses within the very first epoch — RL latches onto a single dominant way of writing that it inherited from pretraining and suppresses the alternatives, and which format wins depends on model scale rather than which one actually performs best Does RL training collapse format diversity in pretrained models?. So 'more training' isn't sampling a wider space; it's deepening a commitment made almost immediately. Mechanically, RL only ever touches a small, structurally consistent slice of the network — 5 to 30 percent of parameters, nearly identical across random seeds — and works mostly by suppressing wrong trajectories rather than building new ones Does reinforcement learning update only a small fraction of parameters? What actually changes inside a model during RL training?. Training follows a predictable two-phase arc (consolidate procedure, then explore strategy), which hints that duration matters in *kind* — early vs. late training do different things — even if no note measures the far tail directly.
The cross-cutting worry about extended RLVR is that running it harder makes its narrowing worse, not its reasoning better. RLVR improves *sampling efficiency* — it concentrates probability on solutions the base model could already reach — but doesn't push the boundary of solvable problems outward; at high sampling budgets the base model actually wins Does RLVR actually expand what models can reason about?. Its on-policy bias actively shrinks the model's range through what one note calls capability boundary collapse: exploitation crowds out exploration Why does RLVR training narrow a model's problem solving ability?. And it polishes the *form* of reasoning faster than the substance — traces get locally more coherent without becoming globally valid proofs Does RLVR actually improve mathematical reasoning or just coherence?. Prolong that pressure and you risk a model that's sharper-looking and narrower at once.
There are two places the corpus gestures at what genuinely *different* extended training would look like. One is curriculum: doing supervised/imitation RL first to build real reasoning scaffolding, then RLVR to sharpen it against verifiable rewards, beats either method alone — because the imitation phase makes the later reward signal informative instead of sparse Does sequencing imitation then exploration training improve reasoning?. The other is scheduling across task types: training order mechanically reshapes entropy, and front-loading structured tasks prevents entropy collapse from wrecking open-ended skills later Does training order reshape how models handle different task types?. Both say the same thing — over a long run, *sequence and signal quality* matter more than raw duration.
The deeper reframe worth taking away: RLVR may not be teaching at all. Spurious or even random rewards still improve some models, because the reward is just a catalyst that surfaces latent behavior baked in during pretraining Why do random rewards improve reasoning for some models but not others? How does RL training reshape reasoning and what gets lost?. If that's right, then 'prolonged RL' can't out-train the pretrained prior no matter how long it runs — which is exactly why the field's energy is shifting toward better curricula, exploration-preserving objectives, and infrastructure like fully asynchronous training that makes long multi-turn runs practical in the first place Can RL training run while generation continues without waiting?. Worth knowing too: pushing too hard with impossibly difficult problems backfires, teaching shortcut-hacking that contaminates skills the model already had Do overly hard RLVR samples actually harm model capabilities?.
Sources 12 notes
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
AReaL enables continuous generation across workers while training runs on mixed model versions using modified PPO. The system achieves high GPU utilization and handles stale samples effectively, making multi-turn RL practical.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.