SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can two simple techniques match complex RL algorithms?

Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

The RL-for-LLM-reasoning field has produced a zoo of algorithms (GRPO, DAPO, GPPO, GFPO) each adding techniques atop PPO: clip-higher, dynamic sampling, overlong filtering, difficulty masking, KL loss, SFT loss. But which techniques actually matter? Systematic isolated evaluation within a unified framework reveals that most RL techniques exhibit obvious preferences and sensitivities to experimental setup — model type, data distribution, reward mechanism, and hyperparameters.

The key finding: employing only two techniques — advantage normalization (group-level mean, batch-level standard deviation) and token-level loss aggregation — unlocks the learning capability of critic-free policies using vanilla PPO loss. This minimalist combination consistently improves performance, surpassing strategies like GRPO and DAPO that incorporate many additional components.

Specific technique findings: (1) group-level normalization shows robust efficiency across reward settings; (2) batch-level normalization provides more stable improvement at larger reward scales; (3) combining group-level mean with batch-level std enables robust normalization; (4) token-level aggregation is effective on base models but shows limited improvement on already-aligned models; (5) overlong filtering helps short-to-medium reasoning but not long-tail reasoning.

This strongly reinforces the existing insight that Does the choice of RL algorithm actually matter for reasoning?. If vanilla PPO with two techniques matches or surpasses GRPO and DAPO, then the algorithmic innovation in the current RL-for-reasoning literature is largely engineering optimization, not fundamental capability improvement. The pretrained prior determines what's achievable; the algorithm determines how efficiently you get there, with diminishing returns from complexity.

GPPO complicates the minimalist story for one specific reason — gradient discarding. Klear-Reasoner (2508.07629) identifies a specific failure mode of standard clipping that the two-technique minimalist set does not address: high-entropy token clipping suppresses critical exploration signals at decision points, and clipped suboptimal trajectories lose their gradient contribution entirely, slowing convergence. Gradient-Preserving Policy Optimization (GPPO) keeps clipped tokens in the backpropagation graph with bounded, mild gradients — preserving exploration signal that Clip-Higher still suppresses. This is the one specific dimension where the minimalist combination genuinely undertrains, and the fix is structural (gradient flow, not advantage shape). The minimalist thesis remains correct for the advantage estimation axis; GPPO is an orthogonal fix on the gradient flow axis.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 121 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

two techniques unlock critic-free ppo matching grpo and dapo — advantage normalization and token-level loss aggregation