Are neural network optimizers actually memory systems?
Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.
Nested Learning (NL) proposes that every component of a neural network — including optimizers — is an associative memory system that compresses its own context flow. This is not a metaphor but a formal claim: given keys K and values V, associative memory is an operator M: K → V, and the optimization of M (minimizing a loss over the mapping) is the learning process.
The key insight: gradient-based optimizers like Adam and SGD with Momentum are themselves associative memory modules that aim to compress the gradient context. When you view Adam's running averages as a memory system compressing the history of gradients, the optimizer is doing the same thing as the neural network layers — learning a useful representation of its input stream. This is self-referential: the optimizer that trains the network is itself a memory system being trained by the data.
From the neuropsychology literature: "Memory is a neural update caused by an input, and learning is the process for acquiring effective and useful memory." Under this definition, any parameter update from gradient descent (at any level of the system) is a memory operation. This dissolves the artificial boundary between "the model" and "the training process" — both are memory systems at different nesting levels.
The practical implication is a new architectural dimension. Stacking more layers (depth) has diminishing returns for several reasons: computational depth may not increase, capacity shows marginal improvement, training may converge suboptimally, and adaptation/continual learning ability doesn't improve. NL suggests adding more nesting levels — nested optimization problems — as an orthogonal dimension to depth.
This yields concrete architectures:
- Self-Modifying Titans: A sequence model that learns how to modify itself by learning its own update algorithm
- HOPE: Combines self-modifying sequence model with a continuum memory system — memories stored at different frequencies/timescales for robust memory management against catastrophic forgetting
- Deep optimizers: More expressive optimizers with deep memory and/or more powerful learning rules, going beyond Adam/SGD
The continuum memory system is particularly interesting: it generalizes the traditional short-term/long-term memory distinction into a continuous spectrum. Memory is distributed throughout all parameters, stored at different timescales, without isolated blocks. This mirrors brain organization — distributed interconnected memory without clear independent components for different time horizons.
The unifying reframe (full paper: "Nested Learning: The Illusion of Deep Learning Architectures", https://arxiv.org/abs/2512.24695). The titular "illusion" is that what we call deep learning is really a set of nested, multi-level optimization problems, each with its own context flow and update frequency. The load-bearing consequence is unification: pre-training, in-context learning, and continual learning are recast as manifestations of one underlying mechanism — learning to compress and reuse context at different levels and timescales — and backpropagation, momentum, and preconditioning are themselves associative-memory mechanisms, specific design points in a larger, previously hidden space. So depth is not the only axis of expressivity; number of nesting levels (and their update frequencies) is an orthogonal one, and higher-order in-context learning is what more levels buy.
Inquiring lines that use this note as a source 1
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans' memory distinction is a special case of NL's continuum memory system
-
When should AI systems do their thinking?
Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.
NL suggests the timing question extends to memory: when to consolidate, at what timescale
-
What happens inside models when they suddenly generalize?
Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
grokking phases may correspond to transitions between nesting levels
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
NL questions whether this localization is fundamental or an artifact of single-level training
-
Can text-trained models compress images better than specialized tools?
Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.
NL operationalizes the compression=generalization principle at the component level: if every NN component is an associative memory module compressing its own context flow, then the compression-as-generalization equivalence applies not just to the whole model but to each optimizer, layer, and memory system independently
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Nested Learning: The Illusion of Deep Learning Architectures
- Nested Learning: The Illusion of Deep Learning Architectures
- Nested Learning: The Illusion of Deep Learning Architecture Expanded
- Titans: Learning to Memorize at Test Time
- Navigating the Latent Space Dynamics of Neural Models
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
- The Vanishing Gradient Problem for Stiff Neural Differential Equations
- Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis
Original note title
all neural network components including optimizers are associative memory modules compressing their own context flow