What happens inside models when they suddenly generalize?

Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?

Synthesis note · 2026-02-23 · sourced from MechInterp

Grokking — the phenomenon where models trained far beyond overfitting suddenly generalize — appears discontinuous from the outside. Mechanistic analysis reveals three continuous phases underneath:

Memorization phase. The model learns to reproduce training data through lookup-table-like mechanisms. Training loss drops, test loss remains high. The memorizing circuit dominates.
Circuit formation phase. A generalizing circuit gradually forms in the weights, competing with the memorizing circuit. For modular addition, this circuit uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. The generalizing circuit is more efficient (uses regularization-favored structure) but initially weaker.
Cleanup phase. The generalizing circuit overtakes the memorizing circuit. Memorization components are pruned away. Test loss drops. Generalization emerges.

Progress measures defined through mechanistic analysis (tracking the formation of specific algorithmic components) allow monitoring grokking as it happens, replacing the seemingly sudden shift with continuous, predictable development.

Two composition findings from the grokked transformers paper:

Composition reasoning (combining facts transitively: A→B and B→C implies A→C) generalizes in-distribution but fails out-of-distribution
Comparison reasoning (comparing attributes: is A greater than B?) generalizes both in-distribution and out-of-distribution

The difference correlates with the circuit configuration — comparison allows more systematic generalization because the comparison operation is simpler to represent compactly. The paper recommends cross-layer knowledge sharing mechanisms (memory augmentation, explicit recurrence) to further unlock transformer generalization.

Formal capacity trigger: The memorization capacity paper (2505.24832) adds a crucial quantitative dimension: GPT-family models have an approximate capacity of 3.6 bits-per-parameter for unintended memorization. Models memorize until this capacity fills, at which point grokking begins and unintended memorization decreases as generalization takes over. This means the three phases are not triggered by training duration per se, but by a measurable capacity saturation event. The paper also formally separates memorization into unintended memorization (information about a specific dataset) and generalization (information about the true data-generation process), and argues that extraction/generation is neither necessary nor sufficient proof of memorization — a model may memorize patterns without reproducing them verbatim.

This connects to How do transformers learn to reason across multiple steps? — both describe staged development of reasoning capability, but grokking requires training far beyond the typical schedule. The practical tension: standard training may terminate before the cleanup phase, leaving models in the memorization phase where they appear to have learned but haven't generalized.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 165 in 2-hop network ·dense cluster Open in graph ↗

What happens inside models when they suddenly ge… How do transformers learn to reason across multipl… Do base models already contain hidden reasoning ab… Does RL training follow a predictable two-phase le… When do language models stop memorizing and start …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How do transformers learn to reason across multiple steps? Does multi-hop reasoning in transformers emerge through distinct learning phases, and what geometric patterns in hidden representations explain when reasoning succeeds or fails?
parallel staged development: memorization → in-distribution → cross-distribution; grokking adds the requirement for extended training
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
grokking suggests capabilities are present but buried under memorization components that must be cleaned up
Does RL training follow a predictable two-phase learning sequence? This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
analogous phased development in RL training
When do language models stop memorizing and start generalizing? Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.
adds the quantitative trigger: grokking begins when 3.6 bits-per-parameter memorization capacity saturates, not at an arbitrary training step

What happens inside models when they suddenly generalize?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4