SYNTHESIS NOTE
Model Architecture and Internals

Can recurrent memory scale where attention fails on ultra-long text?

GPT-4 and RAG plateau around 10,000 tokens and rely heavily on the first quarter of input. Can recurrent memory augmentation overcome these limits and enable reasoning across millions of tokens?

Synthesis note · 2026-06-03 · sourced from RAG

BABILong is a leak-proof benchmark for extracting and processing facts distributed across very long texts (length and placement are algorithmically adjustable, so future LLMs can't have memorized it). Two findings stand out. First, common methods — including GPT-4 and RAG — are effective only for sequences up to ~10⁴ elements, and their performance relies heavily on the first 25% of the input: a stark quantification of the lost-in-the-middle problem, where attention effectively ignores the bulk of a long context. Second, fine-tuning a small GPT-2 with recurrent memory augmentation lets it handle up to 11 million tokens — by far the longest input processed by any neural model — and crucially enables multi-hop reasoning by filtering irrelevant information rather than attending over everything.

The keeper is the comparative claim: recurrent memory excels at filtering irrelevant content in a way that scaling attention does not. Where attention degrades and concentrates on the start of the input, a compact recurrent state forces the model to decide what to carry forward — and that selectivity is what unlocks ultra-long multi-hop reasoning.

This complements the vault's long-context thread from the memory side. It pairs with Can neural memory modules scale language models beyond attention limits? (Titans) as another recurrent-memory route past attention's limits, and it is the empirical lost-in-the-middle ground for How do LLMs balance remembering context versus keeping it separate?. It also sits in productive tension with Can state-space models match transformers at copying and retrieval?: a fixed recurrent state is worse at verbatim copying yet better at filtering for multi-hop fact extraction — the task profile decides which wins.

Inquiring lines that use this note as a source 4

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 108 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

recurrent memory augmentation processes eleven million tokens while LLMs and RAG rely on the first quarter of input — recurrent memory beats attention at filtering ultra-long context