SYNTHESIS NOTE

Are neural network optimizers actually memory systems?

Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.

Synthesis note · 2026-02-23 · sourced from Novel Architectures

Nested Learning (NL) proposes that every component of a neural network — including optimizers — is an associative memory system that compresses its own context flow. This is not a metaphor but a formal claim: given keys K and values V, associative memory is an operator M: K → V, and the optimization of M (minimizing a loss over the mapping) is the learning process.

The key insight: gradient-based optimizers like Adam and SGD with Momentum are themselves associative memory modules that aim to compress the gradient context. When you view Adam's running averages as a memory system compressing the history of gradients, the optimizer is doing the same thing as the neural network layers — learning a useful representation of its input stream. This is self-referential: the optimizer that trains the network is itself a memory system being trained by the data.

From the neuropsychology literature: "Memory is a neural update caused by an input, and learning is the process for acquiring effective and useful memory." Under this definition, any parameter update from gradient descent (at any level of the system) is a memory operation. This dissolves the artificial boundary between "the model" and "the training process" — both are memory systems at different nesting levels.

The practical implication is a new architectural dimension. Stacking more layers (depth) has diminishing returns for several reasons: computational depth may not increase, capacity shows marginal improvement, training may converge suboptimally, and adaptation/continual learning ability doesn't improve. NL suggests adding more nesting levels — nested optimization problems — as an orthogonal dimension to depth.

This yields concrete architectures:

Self-Modifying Titans: A sequence model that learns how to modify itself by learning its own update algorithm
HOPE: Combines self-modifying sequence model with a continuum memory system — memories stored at different frequencies/timescales for robust memory management against catastrophic forgetting
Deep optimizers: More expressive optimizers with deep memory and/or more powerful learning rules, going beyond Adam/SGD

The continuum memory system is particularly interesting: it generalizes the traditional short-term/long-term memory distinction into a continuous spectrum. Memory is distributed throughout all parameters, stored at different timescales, without isolated blocks. This mirrors brain organization — distributed interconnected memory without clear independent components for different time horizons.

The unifying reframe (full paper: "Nested Learning: The Illusion of Deep Learning Architectures", https://arxiv.org/abs/2512.24695). The titular "illusion" is that what we call deep learning is really a set of nested, multi-level optimization problems, each with its own context flow and update frequency. The load-bearing consequence is unification: pre-training, in-context learning, and continual learning are recast as manifestations of one underlying mechanism — learning to compress and reuse context at different levels and timescales — and backpropagation, momentum, and preconditioning are themselves associative-memory mechanisms, specific design points in a larger, previously hidden space. So depth is not the only axis of expressivity; number of nesting levels (and their update frequencies) is an orthogonal one, and higher-order in-context learning is what more levels buy.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Why is consolidation quality the binding constraint in neural memory systems?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 121 in 2-hop network ·medium cluster Open in graph ↗

Are neural network optimizers actually memory sy… Can neural memory modules scale language models be… When should AI systems do their thinking? What happens inside models when they suddenly gene… Why does reasoning training help math but hurt med… Can text-trained models compress images better tha…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can neural memory modules scale language models beyond attention limits? Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans' memory distinction is a special case of NL's continuum memory system
When should AI systems do their thinking? Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.
NL suggests the timing question extends to memory: when to consolidate, at what timescale
What happens inside models when they suddenly generalize? Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
grokking phases may correspond to transitions between nesting levels
Why does reasoning training help math but hurt medical tasks? Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
NL questions whether this localization is fundamental or an artifact of single-level training
Can text-trained models compress images better than specialized tools? Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.
NL operationalizes the compression=generalization principle at the component level: if every NN component is an associative memory module compressing its own context flow, then the compression-as-generalization equivalence applies not just to the whole model but to each optimizer, layer, and memory system independently

Are neural network optimizers actually memory systems?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 5