SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Does the choice of RL algorithm actually matter for reasoning?

Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

"Teaching Large Language Models to Reason with RL" tests Expert Iteration, PPO, and Return-Conditioned RL across multiple model sizes and initialization conditions with both sparse and dense rewards. Result: performance differences across algorithms are small and convergence behavior is similar. More strikingly, RL training does not improve pass@n scores beyond what light supervised fine-tuning achieves with the same rollout budget.

The mechanism: LLMs require a pretrained prior to navigate the high-dimensional text action space — without it, exploration would be computationally impossible. But this prior simultaneously constrains what gets explored. The model generates variations on what it already knows rather than discovering genuinely novel solutions. Regardless of which RL algorithm manages the update step, the same pretrained exploration prior shapes the solution distribution at convergence.

Additional SFT training before RL makes this worse. More SFT concentrates the prior distribution further — the model converges faster on familiar patterns, which means the RL exploration from that point is more constrained, not less. The result: more SFT → tighter prior → smaller effective exploration space → RL finds less.

This reframes what RL training does in practice: it is primarily a selection mechanism, not a discovery mechanism. RL identifies which solutions already present in the pretrained distribution deserve reward. It rarely discovers solutions outside that distribution. The pretrained model contains most of what RL training will eventually "find."

Connects to Does policy entropy collapse limit reasoning performance in RL?: this paper provides algorithm-invariance evidence supporting that entropy is the fundamental constraint. Connects to Do base models already contain hidden reasoning ability?: if RL is unlocking pre-existing capability rather than building new capability, the algorithm doing the unlocking is interchangeable.

Reweave 2026-05-18 — interchangeability now visible at three levels, not one. When this note was written, the claim was about algorithm interchangeability — PPO, Expert Iteration, RC-RL produce similar results because the prior dominates. Late-2025 evidence shows the same interchangeability holds at two additional levels:

  1. Algorithm level (original claim): PPO ≈ Expert Iteration ≈ RC-RL.
  2. Algorithmic refinement level: Can two simple techniques match complex RL algorithms? — vanilla PPO + two techniques matches GRPO and DAPO. The "zoo of algorithms" (GRPO, DAPO, GPPO, GFPO) collapses to two load-bearing techniques.
  3. Reward-signal source level: Can language models replace reward models with internal signals? — the source of the reward signal is also substitutable. SERL self-judgment, ΔBelief-RL internal signal, SDPO rich-feedback distillation, POLAR similarity-to-target, RARO adversarial IRL, and VeriFree reference-likelihood all achieve similar gains.

The meta-claim sharpens: what is interchangeable in RL-for-reasoning is the entire optimization machinery — algorithm choice, algorithmic refinements, AND reward-signal source. The non-interchangeable variable is the pretrained prior. This is consistent with the "RL as catalyst, not teacher" framing in Why do random rewards improve reasoning for some models but not others?: when the prior contains the structure, almost any optimization pressure surfaces it.

The implication is structural rather than tactical: research effort on RL algorithm/refinement/reward-signal innovation has diminishing returns relative to effort on what gets baked into pretraining. The pretrained model contains most of what any RL pipeline will eventually find.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
21 direct connections · 149 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl for reasoning algorithm choice is interchangeable because the exploration ceiling is set by the pretrained prior not the algorithm