Why should scaling laws be understood as properties of data distribution rather than training in general?
This explores a reframing in the corpus: that the regularities we call 'scaling laws' aren't universal facts about throwing more compute at training — they're facts about what already lives in the data the model was trained on, and how different training stages surface or rearrange that material.
This question reads the corpus as quietly arguing that scaling laws describe a *distribution*, not a *process*. The cleanest evidence is what happens during reinforcement learning. When you RL-train a model, it doesn't invent new behavior — it converges on a single dominant format that was already present in the pretraining distribution, amplifying one and suppressing the alternatives within the first epoch Does RL training collapse format diversity in pretrained models?. Strikingly, *which* format wins depends on model scale rather than on which format performs best. So the 'gains' from this training stage are really a redistribution of probability mass over patterns the data already contained. The scaling behavior is a property of what was in the corpus, not of the optimizer.
The same logic appears when you decompose training into stages. Scaling pretraining and scaling fine-tuning improve *different* things — pretraining drives factual knowledge, fine-tuning drives behavioral helpfulness — and this split has a physical home in the network: lower layers store knowledge from the broad distribution, upper layers express behavior Do pretraining and fine-tuning scale independently in language models?. If scaling were a generic property of 'more training,' you wouldn't see two cleanly decoupled curves. You see them because each stage is operating on a different slice of distributional information.
What really sharpens the point is that you can move the scaling curve by changing the data alone. Thinking-augmented pretraining rewrites the training corpus with generated reasoning traces — and harder tokens automatically attract longer traces — yielding 3x data efficiency without touching the architecture or compute schedule Can training data augmentation match test-time compute scaling benefits?. Conversely, a smaller student model can *beat* its LLM teacher purely by being trained on a broader input distribution, smoothed by teacher labels Can smaller models outperform their LLM teachers with enough data?. In both cases the lever is the distribution, not the training quantity — exactly what you'd expect if the 'law' is a property of the data.
The corpus also shows that the original scaling laws aren't even stable across regimes, which is what you'd predict if they're empirical fits to particular distributions rather than physics. For sub-billion-parameter models, depth beats width — directly contradicting the Kaplan prescription that treats parameter count as the master variable Does depth matter more than width for tiny language models?. And once you fold architectural variables (hidden size, MLP-to-attention ratio, GQA) into the scaling law itself, you get large inference gains the original formulation couldn't see Can architecture choices improve inference efficiency without sacrificing accuracy?. A law that needs to be re-fit per architecture and per scale band is describing a conditional relationship in the data, not a universal property of training.
The most provocative thread: this distributional view predicts that scaling 'laws' recur in places that share no training mechanism at all. Deep-research agents searching the web follow the *same* diminishing-returns curve as reasoning-token scaling, even though one is test-time search and the other is generation Do search steps follow the same scaling rules as reasoning tokens?. And latent-thought models open entirely new scaling dimensions decoupled from parameter count Can latent thought vectors scale language models beyond parameters?. If the same curve shows up across training, inference, and search, the regularity can't belong to any one training procedure — it belongs to the structure of the information being consumed. Which means the practical takeaway is unexpected: to bend a scaling curve, change what's in the distribution, not how hard you train on it.
Sources 8 notes
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.
Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.
Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.