What happens inside models when they suddenly generalize?
Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
Grokking — the phenomenon where models trained far beyond overfitting suddenly generalize — appears discontinuous from the outside. Mechanistic analysis reveals three continuous phases underneath:
Memorization phase. The model learns to reproduce training data through lookup-table-like mechanisms. Training loss drops, test loss remains high. The memorizing circuit dominates.
Circuit formation phase. A generalizing circuit gradually forms in the weights, competing with the memorizing circuit. For modular addition, this circuit uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. The generalizing circuit is more efficient (uses regularization-favored structure) but initially weaker.
Cleanup phase. The generalizing circuit overtakes the memorizing circuit. Memorization components are pruned away. Test loss drops. Generalization emerges.
Progress measures defined through mechanistic analysis (tracking the formation of specific algorithmic components) allow monitoring grokking as it happens, replacing the seemingly sudden shift with continuous, predictable development.
Two composition findings from the grokked transformers paper:
- Composition reasoning (combining facts transitively: A→B and B→C implies A→C) generalizes in-distribution but fails out-of-distribution
- Comparison reasoning (comparing attributes: is A greater than B?) generalizes both in-distribution and out-of-distribution
The difference correlates with the circuit configuration — comparison allows more systematic generalization because the comparison operation is simpler to represent compactly. The paper recommends cross-layer knowledge sharing mechanisms (memory augmentation, explicit recurrence) to further unlock transformer generalization.
Formal capacity trigger: The memorization capacity paper (2505.24832) adds a crucial quantitative dimension: GPT-family models have an approximate capacity of 3.6 bits-per-parameter for unintended memorization. Models memorize until this capacity fills, at which point grokking begins and unintended memorization decreases as generalization takes over. This means the three phases are not triggered by training duration per se, but by a measurable capacity saturation event. The paper also formally separates memorization into unintended memorization (information about a specific dataset) and generalization (information about the true data-generation process), and argues that extraction/generation is neither necessary nor sufficient proof of memorization — a model may memorize patterns without reproducing them verbatim.
This connects to How do transformers learn to reason across multiple steps? — both describe staged development of reasoning capability, but grokking requires training far beyond the typical schedule. The practical tension: standard training may terminate before the cleanup phase, leaving models in the memorization phase where they appear to have learned but haven't generalized.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do grokking phases correspond to transitions between nesting levels?
- How does memorization capacity saturation trigger the grokking transition?
- How do the three grokking phases connect to memorization capacity limits?
- Why does grokking reveal the shift from memorization to genuine understanding?
- What makes naive memory consolidation regress below having no memory at all?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How do transformers learn to reason across multiple steps?
Does multi-hop reasoning in transformers emerge through distinct learning phases, and what geometric patterns in hidden representations explain when reasoning succeeds or fails?
parallel staged development: memorization → in-distribution → cross-distribution; grokking adds the requirement for extended training
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
grokking suggests capabilities are present but buried under memorization components that must be cleaned up
-
Does RL training follow a predictable two-phase learning sequence?
This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
analogous phased development in RL training
-
When do language models stop memorizing and start generalizing?
Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.
adds the quantitative trigger: grokking begins when 3.6 bits-per-parameter memorization capacity saturates, not at an arbitrary training step
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- How do Transformers Learn Implicit Reasoning?
- How new data permeates LLM knowledge and how to dilute it
- Progress Measures For Grokking Via Mechanistic Interpretability
- Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
- Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- Nested Learning: The Illusion of Deep Learning Architecture Expanded
- Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
Original note title
grokking reveals three continuous phases of learning — memorization then circuit formation then cleanup