SYNTHESIS NOTE

Does RL training follow predictable scaling curves?

Can we forecast where RL training will plateau before committing full compute? ScaleRL tests whether sigmoid curves reliably predict performance ceilings across 200+ models.

Synthesis note · 2026-02-23 · sourced from Inference time scaling

The first large-scale systematic study of RL scaling for LLMs (400K+ GPU-hours, 200+ models) establishes that RL training follows sigmoidal compute-performance curves. This is the RL equivalent of Chinchilla-style scaling laws for pretraining: given enough data points, you can predict where a training run will plateau before spending the full compute budget.

The critical two-tier finding separates RL design choices into two categories:

Asymptote-setting choices — these determine the performance ceiling. Not all RL recipes converge to the same asymptotic performance. The specific combination of reward design, data composition, and training structure sets a fundamentally different ceiling. Small-scale experiments that use the wrong recipe will predict the wrong ceiling.

Efficiency-modulating details — loss aggregation method, normalization scheme, curriculum design, and off-policy algorithm primarily affect how quickly the model reaches its asymptote, not where that asymptote sits. These are "how fast" knobs, not "how good" knobs.

The practical value: stable, scalable recipes follow predictable trajectories that enable reliable extrapolation from smaller runs. This means researchers can evaluate whether a recipe is promising by running small-scale experiments and fitting the sigmoid, rather than committing to full-scale training. The ScaleRL "best-practice recipe" was validated by successfully predicting performance on a single 100K GPU-hour run.

This refines Does the choice of RL algorithm actually matter for reasoning?: at the algorithm level (PPO vs Expert Iteration vs RC-RL), choice is interchangeable. But at the recipe level (which includes data, reward structure, and training configuration), choice matters for the asymptote. The algorithm-interchangeability finding operates within a recipe; recipe selection sets the ceiling that all algorithms within it approach.

The sigmoid framework also provides the mathematical structure for Does policy entropy collapse limit reasoning performance in RL?: entropy collapse IS the approach to sigmoid saturation. The sigmoid curve predicts when collapse will occur, making the previously unpredictable bottleneck forecastable.

Inquiring lines that use this note as a source 8

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 111 in 2-hop network ·medium cluster Open in graph ↗

Does RL training follow predictable scaling curv… Does the choice of RL algorithm actually matter fo… Does policy entropy collapse limit reasoning perfo… Does RL training follow a predictable two-phase le… Can reinforcement learning discover reasoning stra…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does the choice of RL algorithm actually matter for reasoning? Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
refines: algorithm interchangeability holds within a recipe, but recipe-level choices set different asymptotic ceilings
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
the sigmoid saturation IS entropy collapse approaching the asymptote; ScaleRL provides predictive framework for when it occurs
Does RL training follow a predictable two-phase learning sequence? This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
ScaleRL's sigmoid may aggregate over these phases; the two-phase dynamic could explain the inflection point of the sigmoid
Can reinforcement learning discover reasoning strategies base models cannot? Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
recipes that set higher asymptotes may enable access to novel strategies that lower-asymptote recipes cannot reach

Does RL training follow predictable scaling curves?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4