SYNTHESIS NOTE

Model Architecture and Internals Training, RL, and Test-Time Scaling

Can models learn working memory by attending to their own latents?

Can a feedback loop letting transformers attend to their own internal representations enable them to process indefinitely long sequences without adding extra weights? This explores whether working memory can emerge from self-attention rather than external modules.

Synthesis note · 2026-06-03 · sourced from LLM Architecture

Transformers' quadratic attention caps how much they can process at once, and they suffer "anterograde amnesia" — vast long-term memory in weights, but short-term memory bounded by the attention window. TransformerFAM (Feedback Attention Memory) adds a feedback loop that lets the network attend to its own latent representations, fostering the emergence of working memory and enabling processing of indefinitely long sequences. Two practical virtues: it requires no additional weights (so it integrates seamlessly with pretrained models), and it improves long-context performance across 1B, 8B, and 24B scales.

The keeper is the reframing of memory as feedback over the model's own latents rather than a bolted-on external store — working memory emerges from the architecture attending to itself, and because it adds no weights, existing models can be retrofitted.

This sits in the vault's long-context/memory cluster as a weight-free, feedback-based route. It complements Can neural memory modules scale language models beyond attention limits? (Titans adds a memory module) and Can recurrent memory scale where attention fails on ultra-long text? (recurrent state), and it shares the attend-to-own-latents mechanism with looped/recurrent architectures like Can reasoning happen in latent space during pretraining?.

Inquiring lines that use this note as a source 12

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 128 in 2-hop network ·dense cluster Open in graph ↗

Can models learn working memory by attending to … Can neural memory modules scale language models be… Can recurrent memory scale where attention fails o… Can reasoning happen in latent space during pretra…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can neural memory modules scale language models beyond attention limits? Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans adds a memory module; FAM induces working memory via feedback with no extra weights
Can recurrent memory scale where attention fails on ultra-long text? GPT-4 and RAG plateau around 10,000 tokens and rely heavily on the first quarter of input. Can recurrent memory augmentation overcome these limits and enable reasoning across millions of tokens?
sibling long-context route via recurrent state
Can reasoning happen in latent space during pretraining? Does building iterative computation into pretraining rather than deferring reasoning to post-training actually improve how language models manipulate knowledge? And what would that tell us about where thinking happens?
shares the attend-to-own-latents mechanism

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

TransformerFAM: Feedback attention is working memory0.90 match · arxiv ↗
In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss0.86 match · arxiv ↗
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach0.84 match · arxiv ↗
A Mechanistic Analysis of Looped Reasoning Language Models0.83 match · arxiv ↗
Titans: Learning to Memorize at Test Time0.83 match · arxiv ↗
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers0.82 match · arxiv ↗
Hierarchical Reasoning Model0.82 match · arxiv ↗
Nested Learning: The Illusion of Deep Learning Architectures0.82 match · arxiv ↗

Original note title

feedback attention to a model's own latents fosters working memory for unbounded sequences without extra weights