TOPIC

LLM Memory

26 synthesis notes · 56 source papers

View as

Can LLMs read long documents like humans do?

How might mimicking human reading strategies—storing gist memories and looking up details on demand—help language models handle documents beyond their effective context window?

Is agent memory a storage problem or a connectivity problem?

Most systems treat memory as a repository to store and retrieve. But what if memory's real usefulness depends on how units are linked together rather than what is stored?

How should agents decide what memories to keep?

Agent memory management splits between agents autonomously recognizing important information versus programmatic triggers. Understanding this choice reveals why different memory architectures prioritize different information types.

Should agent memory adapt dynamically based on execution feedback?

Can agents improve performance by continuously reshaping memory connections in response to whether tasks succeed or fail, rather than relying on fixed retrieval pipelines? This matters because static memory degrades in changing environments.

Can three axes replace the short-term long-term memory split?

Does breaking agent memory into forms, functions, and dynamics provide a clearer framework than the traditional short-term/long-term distinction? This matters because current agent-memory literature lacks a unified vocabulary, making comparison between systems nearly impossible.

How should agent memory split across time scales?

Explores whether agent working memory should be organized by temporal scope—some components persisting across a conversation, others refreshed each turn. Understanding this distinction could reveal why some memory designs fail.

Can retrieval knowledge compress into a tiny parametric model?

Can the information stored in large non-parametric retrieval datastores be compressed into a small trainable module? This matters because it could combine retrieval's knowledge benefits with the speed of pure parametric methods.

Can a single model replace retrieval for long-term conversation memory?

COMEDY proposes collapsing the standard retrieval pipeline into one unified model that generates, compresses, and responds. But does eliminating the retriever actually improve performance, or does compression lose critical information?

Can lookup memory and computation work together better than either alone?

Mixture-of-Experts handles dynamic logic, but static knowledge might need a different mechanism. Can a hybrid approach combining conditional computation with fast lookup outperform pure sparse models?

Can models consolidate memories during offline sleep phases?

This explores whether LLMs can use dedicated offline periods to consolidate short-term learning into permanent weights, avoiding catastrophic forgetting and the need for expensive retraining.

Does agent memory degrade when continuously consolidated?

Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.

Why do time-based queries fail in conversational retrieval systems?

Conversational memory systems struggle with questions that reference when something was discussed rather than what was said. Standard vector databases lack temporal indexing to retrieve by metadata like date, speaker, or session order.

Can agents learn better from their failures than successes?

Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.

Can brain memory systems explain how LLMs should store knowledge?

This explores whether the brain's three-tier memory architecture—neocortex, hippocampus, and prefrontal cortex—maps onto transformer weights, external knowledge stores, and agentic state. Understanding this mapping could reveal which AI memory problems each tier solves and which it cannot.

When do language models stop memorizing and start generalizing?

Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.

Has memory architecture replaced parameter count as the scaling frontier?

Late-2025 research suggests the field's next major efficiency gains come from restructuring how models store and use experience rather than simply making them larger. Three convergent signals point to this shift.

Can agents learn continuously from experience without updating weights?

This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.

Can agents fail from weak memory control rather than missing knowledge?

As multi-turn agent workflows grow longer, performance degrades—but is this due to insufficient context or poor memory management? This explores whether memory *control* is the real bottleneck.

Can agents learn preferences by watching rather than asking?

Explores whether multimodal agents can build accurate preference models through continuous observation of user behavior, without explicit instruction, by organizing memory around entities and separating concrete events from derived knowledge.

Where does a model store memorized paragraphs?

Can we pinpoint the specific layers, attention heads, and tokens where language models localize verbatim memorization? Understanding this spatial signature could enable targeted unlearning.

Can storing evolved thoughts prevent inconsistent reasoning in conversations?

When LLMs repeatedly reason over the same conversation history for different questions, they produce inconsistent results. Can storing pre-reasoned thoughts instead of raw history solve this problem?

Can recursive subtask trees overcome context window limits?

Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.

Source papers 56

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Current Large Language Models (LLMs) are not only limited to some maximum context length, but also are not able to robustly consume long inputs. To address these limitations, we propose ReadAgent, an …
AI Agent Traps
As autonomous AI agents increasingly navigate the web, they face a novel challenge: the information environment itself. This gives rise to a critical vulnerability we refer to as "AI Agent Traps" — ad…
AI Agents Need Memory Control Over More Context
AI agents are increasingly used in long, multi-turn workflows in both research and enterprise settings. As interactions grow, agent behavior often degrades due to loss of constraint focus, error accum…
Agent Workflow Memory
Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contr…
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
In this paper, we introduce a novel learning paradigm for adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either …
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation—modifying inputs with instructions, strategies, or evidence, rather than we…
Artifacts as Memory Beyond the Agent Boundary
The situated view of cognition holds that intelligent behavior depends not only on internal memory, but on an agent’s active use of environmental resources. Here, we begin formalizing this intuition w…
Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM1), a family of LLMs trained for recursive and …
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization
Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs…
ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning
Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given th…
Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations
Maintaining long-term conversations has always been a long-term pursuit in current open-domain dialogue systems (Liu et al., 2016; Zhang et al., 2018; Kann et al., 2022; Song et al., 2023), commonly k…
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computat…
Conversation Chronicles: Towards Diverse Temporal and Relational Dynamics in Multi-Session Conversations
In the field of natural language processing, open-domain chatbots have emerged as an important research topic. However, a major limitation of existing open-domain chatbot research is its singular focu…
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue…
Dialogizer: Context-aware Conversational-QA Dataset Generation from Textual Sources
https:// CGMI: Configurable General Multi-Agent Interaction Framework [https://arxiv.org/abs/2308.12503](https://arxiv.org/abs/2308.12503) [[Memory]] [[Role Play]] “With the capabilities of large …
Efficient Nearest Neighbor Language Models
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore, which allows them to learn through explicitly memorizing the training datapoints. W…
Evaluating Very Long-Term Conversational Memory of LLM Agents
![A screenshot of a computer screen](/assets/paper-images/EvaluatingVeryLongTerm2.png) Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning n…
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover in…
From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models
RAISE, an enhancement of the ReAct framework, incorporates a dual-component memory system, mirroring human short-term and long-term memory, to maintain context and continuity in conversations. It enta…
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation model…
Generalization through Memorization: Nearest Neighbor Language Models
We introduce kNN-LMs, which extend a pre-trained neural language model (LM) by linearly interpolating it with a k-nearest neighbors (kNN) model. The nearest neighbors are computed according to distanc…
Generative Agents: Interactive Simulacra of Human Behavior
“Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, …
Hierarchical Reasoning Model
Reasoning, the process of devising and executing complex goal-oriented action sequences, remains a critical challenge in AI. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT…
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involv…
How much do language models memorize?
We propose a new method for estimating how much a model “knows” about a datapoint and use it to measure the capacity of modern language models. We formally separate memorization into two components: u…
It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
![[Pasted image 20251216072354.png]] Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the…
Language Models Need Sleep
Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consol…
Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories
The past few decades have witnessed significant advances in the design of machine learning algorithms–from early studies on task-specific shallow models to more general deep Large Language Models (LLM…
Learning to Relate to Previous Turns in Conversational Search
As in any conversation in natural language, queries in conversational search may involve omissions, references to previous turns, and ambiguities [32]. Thus, a primary challenge for effective conversa…
Learning to Select the Relevant History Turns in Conversational Question Answering
Abstract. The increasing demand for web-based digital assistants has given a rapid rise in the interest of the Information Retrieval (IR) community towards the field of conversational question answeri…
Localizing Paragraph Memorization in Language Models
Can we localize the weights and mechanisms used by a language model to memorize and recite entire paragraphs of its training data? In this paper, we show that while memorization is spread across multi…
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability,…
Making Sense of Memory in AI Agents
However, there’s also another approach to categorizing memory types for AI agents from a design pattern perspective. Sarah Wooders from Letta argues that an LLM is a tokens-in-tokens-out function, not…
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Large Language Models (LLMs) employ autoregressive decoding that requires sequential computation, with each step reliant on the previous one’s output. This creates a bottleneck as each step necessitat…
MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-Domain Conversation
![A diagram of a chat](/assets/paper-images/MemoChat.png) We propose MemoChat, a pipeline for refining instructions that enables large language models (LLMs) to effectively employ self-composed memos…
Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models
Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) re…
Memory in the Age of AI Agents: A Survey — Forms, Functions and Dynamics
Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. It underpins long-horizon reasoning, continual adaptation, and effective interaction with complex e…
Multi-Token Attention
Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and …
Nested Learning: The Illusion of Deep Learning Architecture Expanded
In the previous sections, we discussed the concept of nested learning and how existing well-known components of neural networks such as popular optimizers and architectures fall under the NL paradigm.…
Nested Learning: The Illusion of Deep Learning Architectures
Over the last decades, developing more powerful neural architectures and simultaneously designing optimization algorithms to effectively train them have been the core of research efforts to enhance th…
On the Limits of Innate Planning in Large Language Models
Large language models (LLMs) achieve impressive results on many benchmarks, yet their capacity for planning and stateful reasoning remains unclear. We study these abilities directly, without code exec…
PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes
Large language model (LLM) personalization aims to align model outputs with individuals’ unique preferences and opinions. While recent efforts have implemented various personalization methods, a unifi…
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
The capabilities and limitations of Large Language Models (LLMs) have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstr…
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
To obtain trustworthy evaluation signals, we introduce a generator that creates fully synthetic arithmetic problems of arbitrary length and difficulty, yielding clean datasets we call RandomCalculatio…
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
With the growing adoption of large language model agents in persistent real-world roles, they naturally encounter continuous streams of tasks. A key limitation, however, is their failure to learn from…
Rethinking Memory as Continuously Evolving Connectivity
Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelines, which is brittle in dynamic agentic environments where fe…
See you soon again, chatbot? A design taxonomy to characterize user-chatbot relationships with different time horizons
Users interact with chatbots for various purposes and motivations – and for different periods of time. However, since chatbots are considered social actors and given that time is an essential componen…
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
Imagine that in the future a household robot can autonomously carry out household tasks without your explicit instructions; it must have learned the operational rules of your home through daily experi…
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Large Language Model (LLM) agents have shown stunning results in complex tasks, yet they often operate in isolation, failing to learn from past experiences. Existing memory-based methods primarily sto…
The AI Hippocampus: How Far are We From Human Memory?
Memory plays a foundational role in augmenting the reasoning, adaptability, and contextual fidelity of modern Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs). As these models transition from…
The Emotion-Memory Link: Do Memorability Annotations Matter for Intelligent Systems?
Abstract—Humans have a selective memory, remembering relevant episodes and forgetting the less relevant information. Possessing awareness of event memorability for a user could help intelligent system…
Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable performance in long-term human-machine interactions, which basically relies on iterative recalling and reasoning of history t…
Titans: Learning to Memorize at Test Time
Over more than a decade there has been an extensive research effort of how effectively utilize recurrent models and attentions. While recurrent models aim to compress the data into a fixed-size memory…
Toward Conversational Agents with Context and Time Sensitive Long-term Memory
There has recently been growing interest in conversational agents with long-term memory which has led to the rapid development of language models that use retrieval-augmented generation (RAG). Until r…
Toward Efficient Agents: A Survey of Memory, Tool Learning, and Planning
Recent years have witnessed increasing interest in extending large language models into agentic systems. While the effectiveness of agents has continued to improve, efficiency — which is crucial for r…
Useful Memories Become Faulty When Continuously Updated by LLMs
Learning from past experience benefits from two complementary forms of memory: episodic traces—raw trajectories of what happened—and consolidated abstractions distilled across many episodes into reusa…