SYNTHESIS NOTE
Model Architecture and Internals

Are neural network optimizers actually memory systems?

Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.

Synthesis note · 2026-02-23 · sourced from Novel Architectures

Nested Learning (NL) proposes that every component of a neural network — including optimizers — is an associative memory system that compresses its own context flow. This is not a metaphor but a formal claim: given keys K and values V, associative memory is an operator M: K → V, and the optimization of M (minimizing a loss over the mapping) is the learning process.

The key insight: gradient-based optimizers like Adam and SGD with Momentum are themselves associative memory modules that aim to compress the gradient context. When you view Adam's running averages as a memory system compressing the history of gradients, the optimizer is doing the same thing as the neural network layers — learning a useful representation of its input stream. This is self-referential: the optimizer that trains the network is itself a memory system being trained by the data.

From the neuropsychology literature: "Memory is a neural update caused by an input, and learning is the process for acquiring effective and useful memory." Under this definition, any parameter update from gradient descent (at any level of the system) is a memory operation. This dissolves the artificial boundary between "the model" and "the training process" — both are memory systems at different nesting levels.

The practical implication is a new architectural dimension. Stacking more layers (depth) has diminishing returns for several reasons: computational depth may not increase, capacity shows marginal improvement, training may converge suboptimally, and adaptation/continual learning ability doesn't improve. NL suggests adding more nesting levels — nested optimization problems — as an orthogonal dimension to depth.

This yields concrete architectures:

The continuum memory system is particularly interesting: it generalizes the traditional short-term/long-term memory distinction into a continuous spectrum. Memory is distributed throughout all parameters, stored at different timescales, without isolated blocks. This mirrors brain organization — distributed interconnected memory without clear independent components for different time horizons.

The unifying reframe (full paper: "Nested Learning: The Illusion of Deep Learning Architectures", https://arxiv.org/abs/2512.24695). The titular "illusion" is that what we call deep learning is really a set of nested, multi-level optimization problems, each with its own context flow and update frequency. The load-bearing consequence is unification: pre-training, in-context learning, and continual learning are recast as manifestations of one underlying mechanism — learning to compress and reuse context at different levels and timescales — and backpropagation, momentum, and preconditioning are themselves associative-memory mechanisms, specific design points in a larger, previously hidden space. So depth is not the only axis of expressivity; number of nesting levels (and their update frequencies) is an orthogonal one, and higher-order in-context learning is what more levels buy.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 121 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

all neural network components including optimizers are associative memory modules compressing their own context flow