SYNTHESIS NOTE
Model Architecture and Internals Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Are LLM emergent abilities real or measurement artifacts?

Do large language models develop sudden new capabilities at certain scales, or do discontinuous metrics just make gradual improvements look sudden? This matters because it changes how we predict and interpret model behavior.

Synthesis note · 2026-02-23 · sourced from Flaws
How do LLMs fail to know what they seem to understand?

The sharp, unpredictable transitions that define "emergent abilities" — capabilities appearing suddenly at certain model scales — are artifacts of the researcher's choice of metric rather than fundamental changes in model behavior.

The argument: nonlinear or discontinuous metrics (like exact string match) produce apparent emergent abilities, while linear or continuous metrics (like token edit distance) applied to the same model outputs show smooth, continuous, predictable changes with scale. The "emergence" lives in the measurement, not the model.

Three complementary validations:

  1. InstructGPT/GPT-3 family — tasks with claimed emergent abilities show smooth improvement under continuous metrics
  2. BIG-Bench meta-analysis — claimed emergent abilities evaporate with different metrics or better statistics
  3. Vision tasks — the same metric manipulation produces never-before-seen "emergent abilities" across diverse deep networks, confirming the mechanism is metric-dependent not domain-specific

This doesn't mean models don't improve with scale — they do, continuously. What it challenges is the narrative of sudden capability transitions that implies qualitative changes in what models can do. The practical implication: scaling predictions become much more tractable if improvements are smooth rather than discontinuous.

This connects to Do foundation models learn world models or task-specific shortcuts? — both challenge the narrative of fundamental capability leaps. Heuristics improve gradually with more data; emergence would require qualitative shifts. The metric artifact finding supports the heuristics interpretation.

Inquiring lines that use this note as a source 26

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 160 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

emergent abilities of LLMs are metric artifacts not fundamental scaling behavior changes