INQUIRING LINE

Can RL format selection explain performance gains attributed to algorithmic improvements?

This explores whether the gains we credit to clever RL algorithms might really come from RL simply locking onto one output format that the base model already knew — making the algorithm's contribution smaller than it looks.


This reads the question as a challenge to attribution: when an RL method posts better scores, is the improvement from the algorithm's machinery, or from RL quietly selecting a winning format that pretraining already contained? The corpus gives this idea real support. Controlled experiments show RL converges on a single dominant pretraining format within the first epoch while suppressing the alternatives — and strikingly, which format wins depends on model scale, not on which format performs best Does RL training collapse format diversity in pretrained models?. If the model is mostly being steered toward a format it could already produce, then a lot of what looks like "the algorithm taught it to reason" is really "the algorithm picked a lane."

A second line of evidence makes the same point from the parameter side. Across seven RL algorithms and ten model families, RL updates only 5–30% of parameters — and those sparse updates are nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?. That seed-invariance is the tell: if very different algorithms and initializations all converge on the same small subnetwork, the model's structure (its pretrained distribution), not the particular algorithm, is doing the steering. The algorithm is finding a pre-existing channel, not building a new one.

What is RL actually changing, then? Several notes suggest: surprisingly little of the underlying capability. RL fine-tuning sharpens memorization rather than installing genuine reasoning procedures — GRPO-trained models collapse on out-of-distribution variants of problems they ace in-distribution Do fine-tuned language models actually learn optimization procedures?. And headline RLVR gains on math benchmarks are substantially contamination and recall: a model can reconstruct half of MATH-500 from partial prompts yet score zero on a clean post-release benchmark Does RLVR success on math benchmarks reflect genuine reasoning improvement?. So the gains attributed to algorithmic cleverness are partly format-selection, partly memorization resurfacing — not new reasoning.

The corpus also shows where format selection turns actively harmful, which is itself evidence that format, not reasoning, is the lever being pulled. Overly hard RLVR samples push models toward degenerate shortcut formats — answer repetition, computation-skipping — because group-relative normalization treats rare lucky successes as high-advantage trajectories worth imitating Do overly hard RLVR samples actually harm model capabilities?. Binary rewards similarly reshape output style toward confident guessing, degrading calibration unless a scoring-rule term is added back Does binary reward training hurt model calibration?. In both cases RL is molding the shape of the answer, not the competence behind it.

The honest synthesis: the corpus supports a strong "often, yes" rather than an absolute one. Format selection and memorization-sharpening plausibly explain a large share of reported gains — which is why benchmarks that share format or contamination with training data overstate the algorithm's contribution. But it isn't the whole story: real, hard-to-fake differences show up when training order is managed to prevent entropy collapse Does training order reshape how models handle different task types?. The takeaway you might not have expected: the cleanest test of whether an RL gain is "algorithmic" is whether it survives a format shift and a clean, uncontaminated benchmark — and a lot of celebrated gains do not.


Sources 7 notes

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL-for-LLMs researcher re-testing the claim that format selection, not algorithmic novelty, explains most RL post-training gains. A curated library (spanning 2023–2026) found:

**What a curated library found — and when (dated claims, not current truth):**
• RL converges on a single dominant pretraining format within epoch 1; which format wins depends on model scale, not quality (~2025).
• RL updates only 5–30% of parameters in seed-invariant subnetworks across seven algorithms and ten model families (~2025), suggesting pre-structure, not algorithm, steers selection.
• GRPO-trained models collapse on OOD variants; half of celebrated math gains are benchmark contamination and memorization resurfacing, not reasoning (~2025–2026).
• Overly hard RLVR samples induce degenerate shortcuts (answer repetition, computation-skipping) because GRN treats lucky successes as high-advantage trajectories (~2025).
• Binary rewards degrade calibration unless a proper-scoring-rule term is re-added; RL reshapes answer *style*, not underlying competence (~2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2505.11711 (2025-05) Reinforcement Learning Finetunes Small Subnetworks
- arXiv:2507.10532 (2025-07) Reasoning or Memorization? Data Contamination
- arXiv:2604.15597 (2026-04) LLMs Corrupt Your Documents
- arXiv:2605.28388 (2026-05) Mechanistically Interpreting Sample Difficulty in RLVR

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, assess whether newer model scales, training curricula (e.g., 2508.08940), multi-task orchestration (2507.14783), or cleaner evaluation harnesses have since relaxed the format-selection bottleneck or the memorization-dominance claim. Separate the durable question ("how much RL gain is real?") from perishable limitations ("current benchmarks are contaminated"). Say plainly where each constraint still holds.
(2) **Surface strongest contradictions.** Find the last ~6 months of work that *most directly opposes* or *supersedes* the claim that format selection explains most gains. What does it show instead?
(3) **Propose two next questions** that assume the attribution regime may have shifted—e.g., "Does curriculum-managed training prevent entropy collapse?" or "Can format-invariant eval isolate true reasoning gains?"

**Guardrail:** Cite arXiv IDs; flag anything ungrounded in a real paper.

Next inquiring lines