Are LLM emergent abilities real or measurement artifacts?
Do large language models develop sudden new capabilities at certain scales, or do discontinuous metrics just make gradual improvements look sudden? This matters because it changes how we predict and interpret model behavior.
The sharp, unpredictable transitions that define "emergent abilities" — capabilities appearing suddenly at certain model scales — are artifacts of the researcher's choice of metric rather than fundamental changes in model behavior.
The argument: nonlinear or discontinuous metrics (like exact string match) produce apparent emergent abilities, while linear or continuous metrics (like token edit distance) applied to the same model outputs show smooth, continuous, predictable changes with scale. The "emergence" lives in the measurement, not the model.
Three complementary validations:
- InstructGPT/GPT-3 family — tasks with claimed emergent abilities show smooth improvement under continuous metrics
- BIG-Bench meta-analysis — claimed emergent abilities evaporate with different metrics or better statistics
- Vision tasks — the same metric manipulation produces never-before-seen "emergent abilities" across diverse deep networks, confirming the mechanism is metric-dependent not domain-specific
This doesn't mean models don't improve with scale — they do, continuously. What it challenges is the narrative of sudden capability transitions that implies qualitative changes in what models can do. The practical implication: scaling predictions become much more tractable if improvements are smooth rather than discontinuous.
This connects to Do foundation models learn world models or task-specific shortcuts? — both challenge the narrative of fundamental capability leaps. Heuristics improve gradually with more data; emergence would require qualitative shifts. The metric artifact finding supports the heuristics interpretation.
Inquiring lines that use this note as a source 26
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What other latent LLM capabilities remain inactive without explicit activation cuing?
- Why do intermediate LLM layers become more precise in frontier models?
- What measurement artifacts emerge when annotators interpret the same question differently?
- What causes models to develop domain capability cliffs after specialization?
- Do larger models develop more abstract features than smaller ones?
- How much do metric choices inflate claims about model capabilities?
- What capabilities actually require massive scale versus specialized training regimes?
- Do emergent abilities result from genuine new capabilities or implicit in-context learning?
- What skills can large models identify and organize about their own abilities?
- What language capabilities does fluency on standard benchmarks actually measure?
- What makes some model capabilities reliable while others remain brittle?
- Why do scaling laws show capability saturation at specific thresholds?
- How does the Word Novelty Rate metric measure convention formation?
- Why does the gap between theoretical expressiveness and learned capability matter?
- Can capability boundary collapse be reversed through external data?
- Why do metric choices constrain which model capabilities get developed?
- How does trajectory burstiness compare to other structural properties that shape emergent capabilities?
- What makes well-formatted outputs misleading as evidence of model capability?
- Why do leaderboard metrics fail to capture human flourishing in LLM evaluation?
- What makes a standardized artifact unit measurable across different research domains?
- What emergent behaviors do models develop when trained on underspecified pedagogical tasks?
- Why does exemplar performance vary across order complexity diversity and style?
- What emergent abilities appear only in truly unified multimodal systems?
- What capability dimension does a closed-ended exam actually fail to measure?
- What capability boundary exists in LLM prediction of effect sizes?
- Do rare cultural concepts fail predictably as model scale increases?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
gradual heuristic improvement vs sudden capability emergence; both support the same underlying picture
-
Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
appears to conflict: compositional generalization does emerge at scale, but may do so smoothly rather than suddenly
-
How much of LLM few-shot ability comes from training data?
Do large language models genuinely learn from a few examples, or are they mostly recognizing patterns from their training data? This matters for understanding what LLMs can actually do.
triple challenge to capabilities narrative: metric artifacts inflate emergence claims + task contamination inflates baselines + prompting techniques don't replicate
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Are Emergent Abilities of Large Language Models a Mirage?
- Are Emergent Abilities in Large Language Models just In-Context Learning?
- Progress Measures For Grokking Via Mechanistic Interpretability
- Nested Learning: The Illusion of Deep Learning Architectures
- Interactive Evaluation Requires a Design Science
- LLMs Corrupt Your Documents When You Delegate
- 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
- Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
Original note title
emergent abilities of LLMs are metric artifacts not fundamental scaling behavior changes