Does RL training follow predictable scaling curves?
Can we forecast where RL training will plateau before committing full compute? ScaleRL tests whether sigmoid curves reliably predict performance ceilings across 200+ models.
The first large-scale systematic study of RL scaling for LLMs (400K+ GPU-hours, 200+ models) establishes that RL training follows sigmoidal compute-performance curves. This is the RL equivalent of Chinchilla-style scaling laws for pretraining: given enough data points, you can predict where a training run will plateau before spending the full compute budget.
The critical two-tier finding separates RL design choices into two categories:
Asymptote-setting choices — these determine the performance ceiling. Not all RL recipes converge to the same asymptotic performance. The specific combination of reward design, data composition, and training structure sets a fundamentally different ceiling. Small-scale experiments that use the wrong recipe will predict the wrong ceiling.
Efficiency-modulating details — loss aggregation method, normalization scheme, curriculum design, and off-policy algorithm primarily affect how quickly the model reaches its asymptote, not where that asymptote sits. These are "how fast" knobs, not "how good" knobs.
The practical value: stable, scalable recipes follow predictable trajectories that enable reliable extrapolation from smaller runs. This means researchers can evaluate whether a recipe is promising by running small-scale experiments and fitting the sigmoid, rather than committing to full-scale training. The ScaleRL "best-practice recipe" was validated by successfully predicting performance on a single 100K GPU-hour run.
This refines Does the choice of RL algorithm actually matter for reasoning?: at the algorithm level (PPO vs Expert Iteration vs RC-RL), choice is interchangeable. But at the recipe level (which includes data, reward structure, and training configuration), choice matters for the asymptote. The algorithm-interchangeability finding operates within a recipe; recipe selection sets the ceiling that all algorithms within it approach.
The sigmoid framework also provides the mathematical structure for Does policy entropy collapse limit reasoning performance in RL?: entropy collapse IS the approach to sigmoid saturation. The sigmoid curve predicts when collapse will occur, making the previously unpredictable bottleneck forecastable.
Inquiring lines that use this note as a source 8
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does baseline capability level affect RL improvement ceiling?
- Can scaling predictions become reliable if improvements are continuous not sudden?
- What limits RL's ability to scale for reasoning at training time?
- Which recipe choices determine the asymptotic ceiling in RL training?
- How do RL training and base models differ in creating MI peaks?
- What scaling properties emerge from RL training dynamics beyond verification?
- What training duration is actually needed for RL to expand capabilities?
- How does pretraining determine what RL can later teach a model?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
refines: algorithm interchangeability holds within a recipe, but recipe-level choices set different asymptotic ceilings
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
the sigmoid saturation IS entropy collapse approaching the asymptote; ScaleRL provides predictive framework for when it occurs
-
Does RL training follow a predictable two-phase learning sequence?
This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
ScaleRL's sigmoid may aggregate over these phases; the two-phase dynamic could explain the inflection point of the sigmoid
-
Can reinforcement learning discover reasoning strategies base models cannot?
Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
recipes that set higher asymptotes may enable access to novel strategies that lower-asymptote recipes cannot reach
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Art of Scaling Reinforcement Learning Compute for LLMs
- 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Statistical and Algorithmic Foundations of Reinforcement Learning
- AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
- Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
- rStar2-Agent: Agentic Reasoning Technical Report
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
Original note title
rl training scaling follows predictable sigmoid trajectories — recipe asymptotes differ while implementation details only modulate efficiency