Why does grokking reveal the shift from memorization to genuine understanding?
This explores grokking as a window into the moment a model stops storing answers and starts computing them — and why that transition is measurable rather than mystical.
This explores grokking as a window into the moment a model stops storing answers and starts computing them. The corpus suggests the reason grokking is so revealing is that it makes an internal change *visible from the outside*: a model that has been memorizing for thousands of steps, looking stuck, suddenly generalizes — and mechanistic analysis shows the suddenness is an illusion. The shift was happening continuously underneath. Grokking unfolds in three measurable phases: first memorization via something like a lookup table, then the gradual formation of generalizing circuits, then the *pruning* of the memorized components once the circuit can carry the load What happens inside models when they suddenly generalize?. What looks like a flash of understanding is really the slow construction of machinery followed by the demolition of the scaffolding it replaced.
What turns this from a curiosity into a law is capacity. Models memorize until they physically run out of room — roughly 3.6 bits per parameter for GPT-family models — and only when that storage fills does the phase transition into generalization trigger When do language models stop memorizing and start generalizing?. This reframes 'understanding' in an almost economic way: generalization isn't a virtue the model chooses, it's what happens when rote storage stops being affordable. Memorization is the default; genuine structure is the thing the model is forced into when it can no longer cheat by remembering. Grokking reveals the boundary because it shows you exactly where the cheating becomes impossible.
The corpus has a sharp counterpoint on what *fails* to cross that boundary. Imitation training — copying a stronger model's outputs — produces systems that mimic fluent, confident style without closing any real capability gap, because the ceiling is set by base-model fundamentals, not by the surface you train on Can imitating ChatGPT fool evaluators into thinking models improved?. That's the inverse of grokking: imitation is memorization dressed up to *look* like understanding, while grokking is understanding that arrived without ever looking like it. Both cases warn against trusting external appearances — one model seems stuck but isn't, the other seems competent but is hollow.
There's a stranger thread worth pulling. Models can be trained on deliberately corrupted reasoning traces and still solve problems as well as — sometimes better than — models trained on correct ones, which suggests the visible 'reasoning' often functions as computational scaffolding rather than meaningful thought Do reasoning traces need to be semantically correct?. Read alongside grokking, this is humbling: the surface text of a model's reasoning is not where understanding lives. Grokking locates the real thing inside the weights, in circuits you can only see by looking mechanistically — not in anything the model says about itself.
The thing you may not have known you wanted to know: 'understanding' in these systems has a physical trigger and a measurable address. It isn't a property you coax out with better prompts or prettier traces — it's a phase transition that fires when memory saturates, and grokking is simply the one place where we get to watch it happen.
Sources 4 notes
Models trained past overfitting generalize through three stages: memorization via lookup tables, gradual formation of generalizing circuits, then pruning of memorization components. Mechanistic analysis shows this appears discontinuous externally but progresses continuously, triggered by memorization capacity saturation.
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.