Can two simple techniques match complex RL algorithms?
Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.
The RL-for-LLM-reasoning field has produced a zoo of algorithms (GRPO, DAPO, GPPO, GFPO) each adding techniques atop PPO: clip-higher, dynamic sampling, overlong filtering, difficulty masking, KL loss, SFT loss. But which techniques actually matter? Systematic isolated evaluation within a unified framework reveals that most RL techniques exhibit obvious preferences and sensitivities to experimental setup — model type, data distribution, reward mechanism, and hyperparameters.
The key finding: employing only two techniques — advantage normalization (group-level mean, batch-level standard deviation) and token-level loss aggregation — unlocks the learning capability of critic-free policies using vanilla PPO loss. This minimalist combination consistently improves performance, surpassing strategies like GRPO and DAPO that incorporate many additional components.
Specific technique findings: (1) group-level normalization shows robust efficiency across reward settings; (2) batch-level normalization provides more stable improvement at larger reward scales; (3) combining group-level mean with batch-level std enables robust normalization; (4) token-level aggregation is effective on base models but shows limited improvement on already-aligned models; (5) overlong filtering helps short-to-medium reasoning but not long-tail reasoning.
This strongly reinforces the existing insight that Does the choice of RL algorithm actually matter for reasoning?. If vanilla PPO with two techniques matches or surpasses GRPO and DAPO, then the algorithmic innovation in the current RL-for-reasoning literature is largely engineering optimization, not fundamental capability improvement. The pretrained prior determines what's achievable; the algorithm determines how efficiently you get there, with diminishing returns from complexity.
GPPO complicates the minimalist story for one specific reason — gradient discarding. Klear-Reasoner (2508.07629) identifies a specific failure mode of standard clipping that the two-technique minimalist set does not address: high-entropy token clipping suppresses critical exploration signals at decision points, and clipped suboptimal trajectories lose their gradient contribution entirely, slowing convergence. Gradient-Preserving Policy Optimization (GPPO) keeps clipped tokens in the backpropagation graph with bounded, mild gradients — preserving exploration signal that Clip-Higher still suppresses. This is the one specific dimension where the minimalist combination genuinely undertrains, and the fix is structural (gradient flow, not advantage shape). The minimalist thesis remains correct for the advantage estimation axis; GPPO is an orthogonal fix on the gradient flow axis.
Inquiring lines that use this note as a source 10
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does DPO outperform SFT specifically for function calling tasks?
- How does modified PPO handle samples from much older model versions?
- What role does KL penalty strength play in format selection?
- Does DPO improve or harm LLM behavior in different training contexts?
- Can algorithm choice like PPO substitute for recipe-level design decisions?
- How does KL penalty strength affect the degree of format collapse during RL?
- Why does GRPO outperform PPO for stable empathy training?
- Why does vanilla GRPO cause mode collapse in hybrid reasoning settings?
- Can PPO match GRPO and DAPO with just two techniques?
- How does DVAO balance reward components differently than VPO spreads them?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
directly supports: even simpler than expected — two techniques suffice
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
connects: advantage normalization and token-level loss may work precisely because they manage entropy dynamics
-
Does RL training collapse format diversity in pretrained models?
Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
extends: the format convergence may be inevitable regardless of algorithm, which is why algorithm choice doesn't matter much
-
Can RL training run while generation continues without waiting?
Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?
complementary PPO simplification: AReaL modifies PPO for staleness tolerance in asynchronous training; this note shows PPO needs only two techniques for reasoning performance — together they suggest the PPO framework is more robust and adaptable than the proliferation of replacement algorithms implies
-
How should multiple reward objectives be weighted during training?
When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.
extends: DVAO adds a variance-adaptive weighting layer atop advantage-normalization machinery
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Bridging Offline and Online Reinforcement Learning for LLMs
- LSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following
- Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data
- AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
- DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
- RLHF Workflow: From Reward Modeling to Online RLHF
Original note title
two techniques unlock critic-free ppo matching grpo and dapo — advantage normalization and token-level loss aggregation