What is the theoretical capacity limit before memorization saturates?
This explores whether there's a hard ceiling on how much a model can memorize before it saturates — and what the corpus says about that limit, what sets it, and what happens once it's reached.
This explores whether there's a hard ceiling on how much a language model can memorize before it saturates. The corpus has a surprisingly precise answer: GPT-family models top out at roughly **3.6 bits per parameter** When do language models stop memorizing and start generalizing?. That's not a training quirk — it's a fixed property of model size. A model has a memory budget set by its parameter count, and you can measure how full it is.
The interesting part is what happens when that budget fills. Rather than degrading, the model undergoes a phase transition: it stops cramming examples into weights and starts *generalizing* — the phenomenon called grokking When do language models stop memorizing and start generalizing?. So 'saturation' isn't a failure point; it's the threshold where genuine reasoning is forced to take over because rote storage runs out of room. A related study reinforces that memorization and reasoning coexist as separable factors rather than one replacing the other — chain-of-thought accuracy decomposes into output probability, memorized pattern-matching, and noisy genuine reasoning all operating at once What three separate factors drive chain-of-thought performance?.
The deeper lesson is that the limit only bites if you insist on storing facts *in the weights*. A formal proof shows in-weight recall is bounded by model size — but tool use sidesteps the ceiling entirely, giving unbounded factual recall through a simple lookup circuit, while cramming facts in via fine-tuning actively overwrites prior knowledge Can models store unlimited facts without growing larger?. In other words, the 3.6-bit wall is a wall for one storage strategy, not for the system. Architectures like Titans make the same move internally, splitting short-term attention from a compressed long-term memory that selectively stores only surprising tokens Can neural memory modules scale language models beyond attention limits?.
Where memorization physically lives also tells you why the limit exists. Memorized paragraphs leave a fingerprint — large gradients in low layers and a single attention head fixating on rare tokens Where does a model store memorized paragraphs? — which means memorization occupies specific, finite real estate that can be located and even erased. And in reasoning chains, 'local' memorization based on the immediately preceding tokens dominates, causing up to 67% of errors as problems get harder Where do memorization errors arise in chain-of-thought reasoning?.
The surprise worth leaving with: saturating the memory budget can be *good*. One RLVR study found that test accuracy kept climbing for 1,400 steps *after* training accuracy hit 100% — generalization improving long past the point of memorization saturation, unlocked by a single training example Can a single training example unlock mathematical reasoning?. The capacity limit isn't where models break. It's where they're pushed to stop memorizing and start thinking.
Sources 7 notes
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.
A formal proof and experiments show in-weight memorization is bounded by model size, while tool-use enables unbounded factual recall through a simple circuit. In-weight finetuning also degrades general capability by overwriting prior knowledge.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Memorized paragraphs leave a distinctive fingerprint in GPT-Neo: larger gradients in lower layers, concentration in a specific low-layer attention head attending to rare tokens, and dependence on a few early-prefix tokens. This localization makes memorization targetable for unlearning.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.