SYNTHESIS NOTE

Can two simple techniques match complex RL algorithms?

Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning

The RL-for-LLM-reasoning field has produced a zoo of algorithms (GRPO, DAPO, GPPO, GFPO) each adding techniques atop PPO: clip-higher, dynamic sampling, overlong filtering, difficulty masking, KL loss, SFT loss. But which techniques actually matter? Systematic isolated evaluation within a unified framework reveals that most RL techniques exhibit obvious preferences and sensitivities to experimental setup — model type, data distribution, reward mechanism, and hyperparameters.

The key finding: employing only two techniques — advantage normalization (group-level mean, batch-level standard deviation) and token-level loss aggregation — unlocks the learning capability of critic-free policies using vanilla PPO loss. This minimalist combination consistently improves performance, surpassing strategies like GRPO and DAPO that incorporate many additional components.

Specific technique findings: (1) group-level normalization shows robust efficiency across reward settings; (2) batch-level normalization provides more stable improvement at larger reward scales; (3) combining group-level mean with batch-level std enables robust normalization; (4) token-level aggregation is effective on base models but shows limited improvement on already-aligned models; (5) overlong filtering helps short-to-medium reasoning but not long-tail reasoning.

This strongly reinforces the existing insight that Does the choice of RL algorithm actually matter for reasoning?. If vanilla PPO with two techniques matches or surpasses GRPO and DAPO, then the algorithmic innovation in the current RL-for-reasoning literature is largely engineering optimization, not fundamental capability improvement. The pretrained prior determines what's achievable; the algorithm determines how efficiently you get there, with diminishing returns from complexity.

GPPO complicates the minimalist story for one specific reason — gradient discarding. Klear-Reasoner (2508.07629) identifies a specific failure mode of standard clipping that the two-technique minimalist set does not address: high-entropy token clipping suppresses critical exploration signals at decision points, and clipped suboptimal trajectories lose their gradient contribution entirely, slowing convergence. Gradient-Preserving Policy Optimization (GPPO) keeps clipped tokens in the backpropagation graph with bounded, mild gradients — preserving exploration signal that Clip-Higher still suppresses. This is the one specific dimension where the minimalist combination genuinely undertrains, and the fix is structural (gradient flow, not advantage shape). The minimalist thesis remains correct for the advantage estimation axis; GPPO is an orthogonal fix on the gradient flow axis.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 121 in 2-hop network ·medium cluster Open in graph ↗

Can two simple techniques match complex RL algor… Does the choice of RL algorithm actually matter fo… Does policy entropy collapse limit reasoning perfo… Does RL training collapse format diversity in pret… Can RL training run while generation continues wit… How should multiple reward objectives be weighted …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does the choice of RL algorithm actually matter for reasoning? Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
directly supports: even simpler than expected — two techniques suffice
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
connects: advantage normalization and token-level loss may work precisely because they manage entropy dynamics
Does RL training collapse format diversity in pretrained models? Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
extends: the format convergence may be inevitable regardless of algorithm, which is why algorithm choice doesn't matter much
Can RL training run while generation continues without waiting? Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?
complementary PPO simplification: AReaL modifies PPO for staleness tolerance in asynchronous training; this note shows PPO needs only two techniques for reasoning performance — together they suggest the PPO framework is more robust and adaptable than the proliferation of replacement algorithms implies
How should multiple reward objectives be weighted during training? When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.
extends: DVAO adds a variance-adaptive weighting layer atop advantage-normalization machinery

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

two techniques unlock critic-free ppo matching grpo and dapo — advantage normalization and token-level loss aggregation

Can two simple techniques match complex RL algorithms?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4