Can models learn working memory by attending to their own latents?
Can a feedback loop letting transformers attend to their own internal representations enable them to process indefinitely long sequences without adding extra weights? This explores whether working memory can emerge from self-attention rather than external modules.
Transformers' quadratic attention caps how much they can process at once, and they suffer "anterograde amnesia" — vast long-term memory in weights, but short-term memory bounded by the attention window. TransformerFAM (Feedback Attention Memory) adds a feedback loop that lets the network attend to its own latent representations, fostering the emergence of working memory and enabling processing of indefinitely long sequences. Two practical virtues: it requires no additional weights (so it integrates seamlessly with pretrained models), and it improves long-context performance across 1B, 8B, and 24B scales.
The keeper is the reframing of memory as feedback over the model's own latents rather than a bolted-on external store — working memory emerges from the architecture attending to itself, and because it adds no weights, existing models can be retrofitted.
This sits in the vault's long-context/memory cluster as a weight-free, feedback-based route. It complements Can neural memory modules scale language models beyond attention limits? (Titans adds a memory module) and Can recurrent memory scale where attention fails on ultra-long text? (recurrent state), and it shares the attend-to-own-latents mechanism with looped/recurrent architectures like Can reasoning happen in latent space during pretraining?.
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can predictive self-supervision work on unlabeled sequential visual data?
- Why does externalizing bookkeeping raise effective feedback compute?
- Can recurrent state mechanisms process longer sequences than attention-based working memory approaches?
- How do adaptive memory modules compare to feedback-based working memory for long context?
- What makes looped latent computation more efficient than scaling attention capacity?
- Why does attending to own latents work better than bolted-on external memory stores?
- Can recurrent transformers learn genuinely new computations beyond inference stages?
- Why does attention concentrate on the first 25% of long input sequences?
- Can adaptive memory modules combine long-term filtering with short-term attention benefits?
- Does attention linearity alone explain the efficiency gains over standard transformers?
- Why do hybrid attention architectures outperform pure linear attention models?
- How do recurrent memory systems handle ultra-long context differently than attention?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans adds a memory module; FAM induces working memory via feedback with no extra weights
-
Can recurrent memory scale where attention fails on ultra-long text?
GPT-4 and RAG plateau around 10,000 tokens and rely heavily on the first quarter of input. Can recurrent memory augmentation overcome these limits and enable reasoning across millions of tokens?
sibling long-context route via recurrent state
-
Can reasoning happen in latent space during pretraining?
Does building iterative computation into pretraining rather than deferring reasoning to post-training actually improve how language models manipulate knowledge? And what would that tell us about where thinking happens?
shares the attend-to-own-latents mechanism
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- TransformerFAM: Feedback attention is working memory
- In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
- A Mechanistic Analysis of Looped Reasoning Language Models
- Titans: Learning to Memorize at Test Time
- Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
- Hierarchical Reasoning Model
- Nested Learning: The Illusion of Deep Learning Architectures
Original note title
feedback attention to a model's own latents fosters working memory for unbounded sequences without extra weights