SYNTHESIS NOTE
Model Architecture and Internals Agentic Systems and Tool Use Conversational AI and Personalization

Can agents learn preferences by watching rather than asking?

Explores whether multimodal agents can build accurate preference models through continuous observation of user behavior, without explicit instruction, by organizing memory around entities and separating concrete events from derived knowledge.

Synthesis note · 2026-04-18 · sourced from Memory
Why do AI conversations reliably break down after multiple turns? Why do multi-agent systems fail despite individual capability? How should researchers navigate LLM reasoning research?

M3-Agent (2508.09736) proposes a multimodal agent framework where long-term memory is organized as an entity-centric graph, with two types of memory generated from continuous video-stream perception:

Episodic memory records concrete events: "Alice takes the coffee and says, 'I can't go without this in the morning.'" Semantic memory derives general knowledge: "Alice prefers to drink coffee in the morning." Information about the same entity — face, voice, textual knowledge — is connected in graph format, incrementally established as the agent extracts and integrates semantic memory.

The architecture runs two parallel processes: (1) memorization, which continuously perceives real-time multimodal inputs to construct and update long-term memory; and (2) control, which interprets external instructions, reasons over stored memory, and executes tasks. This dual-process design means the agent can hand you coffee without asking "coffee or tea?" — it has already formed a memory of your preferences through observation.

The entity-centric graph structure is the key architectural choice. Unlike flat memory stores or conversation-history retrieval, entity-centric organization enables cross-modal association: a person's face links to their voice links to their preferences. This mirrors how Does abstract preference knowledge outperform specific interaction recall? — but M3-Agent captures both episodic and semantic layers and connects them through entity nodes rather than discarding one.

The dual episodic/semantic distinction also echoes the hierarchical knowledge source in Can reasoning systems maintain memory across retrieval cycles?, where ComoRAG builds veridical, semantic, and episodic layers — but M3-Agent applies this to continuous multimodal perception rather than text retrieval.

Since How should agents decide what memories to keep?, M3-Agent's memorization process operates as continuous implicit memory — always running, always extracting, rather than waiting for explicit recognition of importance.

Inquiring lines that use this note as a source 43

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 77 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

multimodal agents require entity-centric memory graphs that separate episodic events from semantic knowledge — parallel memorization and control processes mirror human cognitive architecture