Can agents learn preferences by watching rather than asking?
Explores whether multimodal agents can build accurate preference models through continuous observation of user behavior, without explicit instruction, by organizing memory around entities and separating concrete events from derived knowledge.
M3-Agent (2508.09736) proposes a multimodal agent framework where long-term memory is organized as an entity-centric graph, with two types of memory generated from continuous video-stream perception:
Episodic memory records concrete events: "Alice takes the coffee and says, 'I can't go without this in the morning.'" Semantic memory derives general knowledge: "Alice prefers to drink coffee in the morning." Information about the same entity — face, voice, textual knowledge — is connected in graph format, incrementally established as the agent extracts and integrates semantic memory.
The architecture runs two parallel processes: (1) memorization, which continuously perceives real-time multimodal inputs to construct and update long-term memory; and (2) control, which interprets external instructions, reasons over stored memory, and executes tasks. This dual-process design means the agent can hand you coffee without asking "coffee or tea?" — it has already formed a memory of your preferences through observation.
The entity-centric graph structure is the key architectural choice. Unlike flat memory stores or conversation-history retrieval, entity-centric organization enables cross-modal association: a person's face links to their voice links to their preferences. This mirrors how Does abstract preference knowledge outperform specific interaction recall? — but M3-Agent captures both episodic and semantic layers and connects them through entity nodes rather than discarding one.
The dual episodic/semantic distinction also echoes the hierarchical knowledge source in Can reasoning systems maintain memory across retrieval cycles?, where ComoRAG builds veridical, semantic, and episodic layers — but M3-Agent applies this to continuous multimodal perception rather than text retrieval.
Since How should agents decide what memories to keep?, M3-Agent's memorization process operates as continuous implicit memory — always running, always extracting, rather than waiting for explicit recognition of importance.
Inquiring lines that use this note as a source 43
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does learning community preferences as training rewards operationalize prediction without participation?
- How should preference channels from historical sessions inform unified policy learning?
- How should historical preferences be weighted when users change their stated intent?
- Does sequential structure within sessions complement cross-session preference channels?
- How much task-relevant persona information is needed for accurate preference prediction?
- Can curiosity-driven dialogue incrementally discover user interest journeys in real time?
- Why do abstract semantic memories outperform specific interaction histories for journey discovery?
- Can agents learn user intent from unlabeled video without text labels?
- Can users articulate what they want before AI helps them discover it?
- How can a single policy handle both asking preferences and recommending items?
- Can curiosity-driven personalization work better than pre-conversation preference elicitation?
- How do implicit signals like clicks capture preference more reliably than explicit ratings?
- Can side information alone predict preferences without rating history?
- Why might text-only interfaces underestimate agent preference elicitation capabilities?
- Can curiosity rewards about user type complement general social motivation frameworks?
- Can subjective tasks be delegated without human feedback loops?
- What structural signals in user language reveal their unstated preferences and context?
- How should systems learn what each meeting participant actually cares about?
- Can users detect and correct an AI's mental model of their preferences?
- Does semantic memory improve AI personalization more than episodic memory?
- How can agents detect whether users are willing to follow their topic guidance?
- How can agents learn when silence is better than intervention?
- When should agents accommodate user preferences over their own goals?
- Can agents balance goal-driven proactivity with user preference alignment?
- How does active learning reduce queries needed for user preference inference?
- Why do agents fail to internalize value from informative observations?
- Can abstract preference summaries substitute for specific user interaction history?
- When does combining episodic and semantic memory reduce personalization performance?
- Can input-only training encode user preferences without task-specific labels?
- What distinguishes genuine user preferences from similar-user preferences in sparse data?
- Could AI agents scale the friend-with-different-preferences recommendation mechanism?
- How can insert-expansion techniques help users discover their own preferences?
- What multi-turn reward structures would encourage active intent discovery?
- Can multimodal agents use entity-centric graphs within this three-axis framework?
- What stops AI from helping users articulate preferences they cannot express?
- Can relationship dynamics between user and agent be tracked as distinct memory?
- How can agents learn user preferences during conversation without pre-calibration?
- How do entity graphs connect faces, voices, and preferences across modalities?
- Why does semantic memory abstraction outperform raw episodic recall for personalization?
- What triggers control processes to act on stored preference knowledge?
- Can multimodal architectures successfully integrate vision without replicating past failures?
- Can rich environment feedback replace human preference labels entirely?
- Why does continuous agent inference differ from human user inference?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can three axes replace the short-term long-term memory split?
Does breaking agent memory into forms, functions, and dynamics provide a clearer framework than the traditional short-term/long-term distinction? This matters because current agent-memory literature lacks a unified vocabulary, making comparison between systems nearly impossible.
M3-Agent's episodic/semantic split is a specific instantiation along the *functions* axis; its parallel memorization+control processes are an instantiation of the *dynamics* axis (formation operator runs continuously, retrieval operator is goal-triggered)
-
Can brain memory systems explain how LLMs should store knowledge?
This explores whether the brain's three-tier memory architecture—neocortex, hippocampus, and prefrontal cortex—maps onto transformer weights, external knowledge stores, and agentic state. Understanding this mapping could reveal which AI memory problems each tier solves and which it cannot.
M3-Agent's entity-centric graph functions as a hippocampal-style index that binds disparate elements (face, voice, knowledge) of an entity across modalities — the AI analog of how the hippocampus binds elements of an episode across cortical regions
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Preference Discerning with LLM-Enhanced Generative Retrieval
- Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning
- PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes
- PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time
- On Generative Agents in Recommendation
- User-Centric Conversational Recommendation with Multi-Aspect User Modeling
- Large Multimodal Agents: A Survey
- Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
Original note title
multimodal agents require entity-centric memory graphs that separate episodic events from semantic knowledge — parallel memorization and control processes mirror human cognitive architecture