Can RL format selection explain performance gains attributed to algorithmic improvements?
This explores whether the gains we credit to clever RL algorithms might really come from RL simply locking onto one output format that the base model already knew — making the algorithm's contribution smaller than it looks.
This reads the question as a challenge to attribution: when an RL method posts better scores, is the improvement from the algorithm's machinery, or from RL quietly selecting a winning format that pretraining already contained? The corpus gives this idea real support. Controlled experiments show RL converges on a single dominant pretraining format within the first epoch while suppressing the alternatives — and strikingly, which format wins depends on model scale, not on which format performs best Does RL training collapse format diversity in pretrained models?. If the model is mostly being steered toward a format it could already produce, then a lot of what looks like "the algorithm taught it to reason" is really "the algorithm picked a lane."
A second line of evidence makes the same point from the parameter side. Across seven RL algorithms and ten model families, RL updates only 5–30% of parameters — and those sparse updates are nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?. That seed-invariance is the tell: if very different algorithms and initializations all converge on the same small subnetwork, the model's structure (its pretrained distribution), not the particular algorithm, is doing the steering. The algorithm is finding a pre-existing channel, not building a new one.
What is RL actually changing, then? Several notes suggest: surprisingly little of the underlying capability. RL fine-tuning sharpens memorization rather than installing genuine reasoning procedures — GRPO-trained models collapse on out-of-distribution variants of problems they ace in-distribution Do fine-tuned language models actually learn optimization procedures?. And headline RLVR gains on math benchmarks are substantially contamination and recall: a model can reconstruct half of MATH-500 from partial prompts yet score zero on a clean post-release benchmark Does RLVR success on math benchmarks reflect genuine reasoning improvement?. So the gains attributed to algorithmic cleverness are partly format-selection, partly memorization resurfacing — not new reasoning.
The corpus also shows where format selection turns actively harmful, which is itself evidence that format, not reasoning, is the lever being pulled. Overly hard RLVR samples push models toward degenerate shortcut formats — answer repetition, computation-skipping — because group-relative normalization treats rare lucky successes as high-advantage trajectories worth imitating Do overly hard RLVR samples actually harm model capabilities?. Binary rewards similarly reshape output style toward confident guessing, degrading calibration unless a scoring-rule term is added back Does binary reward training hurt model calibration?. In both cases RL is molding the shape of the answer, not the competence behind it.
The honest synthesis: the corpus supports a strong "often, yes" rather than an absolute one. Format selection and memorization-sharpening plausibly explain a large share of reported gains — which is why benchmarks that share format or contamination with training data overstate the algorithm's contribution. But it isn't the whole story: real, hard-to-fake differences show up when training order is managed to prevent entropy collapse Does training order reshape how models handle different task types?. The takeaway you might not have expected: the cleanest test of whether an RL gain is "algorithmic" is whether it survives a format shift and a clean, uncontaminated benchmark — and a lot of celebrated gains do not.
Sources 7 notes
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.