Can scaling predictions become reliable if improvements are continuous not sudden?
This explores whether smooth, predictable scaling curves (vs. sudden emergent jumps) are what make it possible to forecast a model's behavior from small runs — and where that reliability breaks down.
This explores whether continuity is the thing that buys you reliable scaling predictions — the intuition being that if capability climbs smoothly rather than leaping out of nowhere, you can extrapolate from cheap small-scale runs. The corpus broadly supports the intuition, but with a sharp caveat: continuity is necessary, not sufficient. The strongest evidence comes from a 400K GPU-hour study showing RL training follows predictable sigmoid trajectories, where small runs reliably forecast where a recipe will asymptote Does RL training follow predictable scaling curves?. The key insight there isn't just that the curve is smooth — it's that the *recipe* sets the ceiling while implementation details only move efficiency. Stability of the underlying process, not smoothness alone, is what makes extrapolation trustworthy.
The complication is that 'the model' doesn't have one scaling curve — it has many, and they don't move together. A skill-level decomposition shows logical reasoning improving continuously while metacognition saturates at 7B parameters and stylistic skills plateau early Do all AI skills improve equally as models scale?. So a prediction that holds beautifully for reasoning can be flatly wrong for a skill that already hit its ceiling. Continuity per skill is real, but the aggregate 'how good is this model' number is a sum of curves with different shapes — which is exactly where naive extrapolation misleads.
There's also a deeper trap worth naming: smoothness can create false confidence. Deterministic settings produce perfectly consistent, repeatable outputs that are still just one draw from a probability distribution — consistency is not reliability Does setting temperature to zero actually make LLM outputs reliable?. The same caution scales up to prediction itself: a clean curve tells you the trend is stable, not that the metric you're plotting means what you think. Binary-reward training, for instance, can keep accuracy climbing predictably while quietly destroying calibration, so the model gets more confidently wrong even as the headline number improves Does binary reward training hurt model calibration?.
What actually makes predictions more reliable, the corpus suggests, is widening what goes into them rather than hoping for smoothness. Conditional scaling laws that fold in architectural variables — hidden size, MLP-to-attention ratio, GQA config — predict inference behavior well enough to optimize for it, yielding both higher accuracy and 42% more throughput Can architecture choices improve inference efficiency without sacrificing accuracy?. And the resource axes aren't independent: inference-time compute trades off against parameter count on hard prompts Can inference compute replace scaling up model size?, while pretraining and fine-tuning scale along separate channels — one driving factuality, the other helpfulness Do pretraining and fine-tuning scale independently in language models?. A prediction that only varies one knob will be continuous and still wrong.
The thing you might not have known you wanted to know: the failure case for forecasting isn't the sudden jump — it's the smooth curve that keeps climbing right up to a plateau you didn't see coming, like natural-language critique unsticking a model from a numerical-reward ceiling that looked like a hard wall Can natural language feedback overcome numerical reward plateaus?. Continuity makes extrapolation *possible*; knowing your recipe, your per-skill curves, and what each metric actually measures is what makes it *reliable*.
Sources 8 notes
Large-scale study (400K GPU-hours, 200+ models) shows RL performance scales sigmoidally. Recipe choices set the ceiling; implementation details only affect efficiency. Stable recipes enable reliable extrapolation from small runs.
FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.