What persistent memory architectures best support storing precomputed inferences across sessions?

This explores how systems can hold onto *already-digested* reasoning — consolidated state, not raw transcripts — so a later session inherits the thinking instead of recomputing it, and which memory designs the corpus suggests actually pull that off.

This explores how to store *precomputed* inferences across sessions — the digested results of reasoning, not raw logs — and which memory architectures the corpus says hold up. The first surprise is that the corpus reframes the question: the hard part may not be storage at all. One line of work argues the real bottleneck is the *compute* needed to fold evicted context into a model's fast weights during an offline 'sleep' phase, and that performance keeps climbing with more consolidation passes Is long-context bottleneck really about memory or compute?. If that's right, 'best architecture' partly means 'best place to spend consolidation compute,' not 'biggest cache.'

A cluster of designs answers by separating fast, short-term attention from a slower, compressed long-term store. Titans keeps attention for the immediate window but routes *surprising* tokens into a neural memory module, letting it carry state past two million tokens without paying the quadratic cost of attention Can neural memory modules scale language models beyond attention limits?. The interesting design choice there is selectivity — it doesn't store everything, it stores what was unexpected, which is itself a kind of precomputed judgment about what's worth keeping.

For agents specifically, the corpus leans toward *structured schemas* rather than flat history. DeepAgent's memory folding autonomously compresses past interactions into episodic, working, and tool memories, cutting token overhead while preserving enough to let the agent pause and rethink strategy Can agents compress their own memory without losing critical details?. AgentFly pushes this furthest: it treats learning itself as a memory operation, storing cases, subtasks, and tool outcomes so the agent improves across sessions *without ever updating weights* — hitting 87.88% on GAIA purely through memory reads and writes Can agents learn continuously from experience without updating weights?. That's precomputed inference as a first-class substitute for retraining.

The retrieval-and-reasoning side adds a sharper requirement: a persistent workspace has to do more than recall — it has to *reconcile*. ComoRAG keeps a stateful memory workspace across retrieval cycles and uses it to detect and resolve contradictions, beating stateless multi-step retrieval by up to 11% on hard queries Can reasoning systems maintain memory across retrieval cycles?. Stored inferences aren't inert; the value comes from a workspace that can notice when two stored conclusions disagree.

Worth knowing is the contrarian voice in the same corpus. Recursive subtask trees with KV-cache pruning sustain reasoning even after discarding 90% of the cache Can recursive subtask trees overcome context window limits?, and Atom of Thoughts goes fully *memoryless* — each state depends only on the current contracted problem, deliberately shedding accumulated history as bloat Can reasoning systems forget history without losing coherence?. The tension is the real takeaway: the field hasn't agreed that persisting inferences is even desirable. Structured, selective, reconciling memory wins where continuity matters; aggressive forgetting wins where history is mostly baggage — and which you want depends on whether tomorrow's session needs to *remember* or merely to *start clean*.

Sources 7 notes

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can reasoning systems maintain memory across retrieval cycles?

ComoRAG demonstrates that iterative evidence acquisition with a persistent memory workspace outperforms stateless multi-step retrieval by detecting and resolving contradictions through deeper exploration, achieving up to 11% gains on complex queries.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher evaluating whether persistent memory architectures for precomputed inference remain bottlenecked or have shifted regimes. The question: which memory designs best store digested reasoning across sessions?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable:
• The real bottleneck may be *compute* to fold evicted context into fast weights during offline consolidation, not storage capacity itself (~2025).
• Selective routing (Titans: routing surprising tokens to neural memory, not all tokens) sustains 2M+ tokens without quadratic attention cost (~2024–12).
• Structured memory (DeepAgent's episodic/working/tool split; AgentFly's weight-free learning via memory reads/writes alone) cuts token overhead and enables 87.88% GAIA performance across sessions without retraining (~2025–10, ~2025–12).
• Stateful workspaces (ComoRAG) that reconcile contradictions between stored inferences outperform stateless retrieval by up to 11% (~2025–08).
• Aggressive memorylessness (Atom of Thoughts: each state depends only on contracted problem, 90% cache pruning) challenges the assumption that persisting inferences is desirable (~2025–02).

Anchor papers (verify; mind their dates):
• arXiv:2501.00663 (Titans, 2024–12): selective neural memory routing
• arXiv:2508.10419 (ComoRAG, 2025–08): stateful narrative reasoning with contradiction detection
• arXiv:2510.21618 (DeepAgent, 2025–10): structured episodic memory folding
• arXiv:2512.24601 (Atom of Thoughts, 2025–02): memoryless Markov-style reasoning

Your task:
(1) RE-TEST EACH CONSTRAINT. For selectivity (Titans), consolidation compute, and structured folding (DeepAgent), determine whether *recent* advances in KV-cache compression, sparse attention, or learned routing have shifted the bottleneck from *where* to store inferences to *whether storing infers at all improves downstream task performance*. Separately: has the reconciliation requirement (ComoRAG) become table-stakes, or do newer agents succeed without explicit contradiction detection?
(2) Surface the strongest work from the last ~6 months that *contradicts* the structured-memory consensus—especially any paper showing weight-free memory adaptation fails at scale, or that simpler caching (no selectivity) now matches selective routing.
(3) Propose two research questions: (a) Does continuous online updating of stored inferences (vs. write-once-per-session) help or degrade agentic continuity? (b) At what session-interaction count does the compute cost of consolidation exceed the value of memory retrieval?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What persistent memory architectures best support storing precomputed inferences across sessions?

Sources 7 notes

Next inquiring lines