Do grokking phases correspond to transitions between nesting levels?
This reads 'grokking phases' as the memorize→generalize→prune stages of delayed generalization, and 'nesting levels' as moving up or down a hierarchy of structure (recursive depth, taxonomic levels) — so the question is whether each phase shift is the model climbing a level of nested structure.
This explores whether the stages of grokking line up with a model moving between levels of nested or hierarchical structure. The honest answer from the corpus: grokking's phases are real and measurable, but they aren't described as transitions between nesting levels — they're triggered by capacity, not by climbing a hierarchy. Grokking unfolds in three continuous internal phases (memorization via lookup tables, gradual formation of generalizing circuits, then pruning of the memorized parts), which look like a sudden jump from the outside but are smooth underneath What happens inside models when they suddenly generalize?. What actually flips the switch is memorization capacity saturating — models memorize up to roughly 3.6 bits per parameter, and only once that store fills does the phase transition into generalization begin When do language models stop memorizing and start generalizing?. So the phase boundary is set by how full the model is, not by which level of a nested structure it has reached.
That said, the corpus does contain something close to the spirit of the question — just in a different place. When you look at how representations organize themselves, learning genuinely does move through nesting levels: the leading eigenvectors of embedding spaces split taxonomy coarse-to-fine, separating broad branches first and progressively finer sub-branches, tracking the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. That is a real 'transition between nesting levels,' but it's a property of how structure precipitates in representation space, not a relabeling of grokking's three phases.
The more suggestive overlap is that staged learning seems to be a recurring shape, independent of the task. Transformers learning multi-hop reasoning pass through three developmental phases of their own — memorization, in-distribution generalization, then cross-distribution reasoning — with successful reasoning marked by entity representations clustering together How do transformers learn to reason across multiple steps?. SFT-then-RL training shows yet another three-phase arc: capability disruption, readaptation, then overfitting Why does SFT-then-RL training follow a predictable three-phase pattern?. The pattern 'memorize, then reorganize, then refine' keeps reappearing — which hints that grokking's phases may be one instance of a general learning rhythm rather than a hierarchy-climbing process specifically.
Where nesting becomes load-bearing is architecture, not training dynamics. Pushdown Layers add an explicit stack to attention and get 3–5x more sample-efficient syntactic generalization, showing that recursive (genuinely nested) structure benefits from being built in rather than waited for — the model doesn't reliably grok deep nesting on its own Can explicit stack tracking improve how transformers learn recursive syntax?. So the thing you might hope grokking does — work its way up nesting levels — is exactly the thing the corpus suggests needs an architectural crutch.
The takeaway you might not have expected: grokking and nesting-level transitions are two different stories the corpus tells, and conflating them is tempting but unsupported. Grokking is a capacity-driven phase change; coarse-to-fine and recursive structure are about how representations and architectures organize hierarchy. The interesting open territory is whether the capacity threshold that triggers grokking is what frees a model to start resolving finer levels of structure — but the collection documents these as parallel phenomena, not as the same mechanism.
Sources 6 notes
Models trained past overfitting generalize through three stages: memorization via lookup tables, gradual formation of generalizing circuits, then pruning of memorization components. Mechanistic analysis shows this appears discontinuous externally but progresses continuously, triggered by memorization capacity saturation.
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.
CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.
Pushdown Layers—a drop-in self-attention replacement with explicit stack tracking—achieve 3-5x more sample-efficient syntactic generalization while maintaining perplexity. The improvement shows that recursive structure specifically benefits from architectural inductive bias despite general compositional generalization emerging from scale.