Why do power-law distributions make standard ML infrastructure assumptions fail?
This reads the question as asking why long-tailed, frequency-skewed data breaks two load-bearing assumptions of ML infrastructure — that scale improves things uniformly, and that performance is roughly even across the tasks you throw at a model.
This explores why power-law structure — where a few cases are common and a vast tail is rare — clashes with how we build and measure ML systems. The corpus doesn't use the phrase "power-law" much, but it keeps circling the same underlying fact from different angles: competence tracks frequency. The clearest statement is the "embers of autoregression" work Can we predict where language models will fail?, which frames LLMs as autoregressive probability machines and predicts that low-probability targets are systematically harder — even when the task is logically trivial, like reciting the alphabet backwards or counting letters. In a power-law world, most of what you'll actually encounter lives in the low-probability tail. So a system whose competence is shaped by output probability is, by construction, weakest exactly where the distribution puts most of its events.
The first infrastructure assumption this breaks is the scaling reflex: add parameters, data, or compute and quality rises uniformly. The corpus repeatedly shows scale failing to rescue the tail. LLMs plateau around 55–60% on genuine constraint satisfaction regardless of architecture or parameter count Do larger language models solve constrained optimization better? — a ceiling, not a gap a bigger model closes. Even Kaplan-style scaling laws get contradicted at small scale, where deep-and-thin beats wide for the same budget Does depth matter more than width for tiny language models?. And inference compute trades off against parameter scaling rather than stacking cleanly on top of it Can inference compute replace scaling up model size?. The shared lesson: "more of the same" mostly buys you the head of the distribution you were already good at.
The second assumption it breaks is that evaluation metrics tell you how a system will behave. Power-law tails are nearly invisible to standard benchmarks because benchmarks sample the head. One paper shows models with perfect linear decodability hiding fractured internal representations that shatter under distribution shift — failure that standard metrics simply can't see Can models be smart without organized internal structure?. Another shows reasoning collapsing the moment you decouple a task from its familiar training-distribution semantics Do large language models reason symbolically or semantically?. Your aggregate score looks fine precisely because the rare cases that will break you are rare in your test set too.
Here's the part worth carrying away: these tail failures aren't random noise you can average out — they're predictable from where a case sits in the distribution. Linguistic errors worsen smoothly and forecastably as structural complexity rises Why do large language models fail at complex linguistic tasks?, and failure location itself can be predicted from the computational level Can we predict where language models will fail?. Standard ML infrastructure is built on the comforting assumption that errors are i.i.d. and shrink with scale and data. A power-law distribution violates both halves at once: the tail never thins enough to disappear, and the failures in it are structured rather than stochastic. That's why the usual levers — bigger models, more data, higher benchmark scores — keep aiming at the head while the tail quietly decides whether the system actually works.
Sources 7 notes
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.