Can multimodal agents use entity-centric graphs within this three-axis framework?
This explores whether the entity-centric memory graphs that multimodal agents use to track people and objects fit naturally into the 'three-axis' view that says reliable agents come from externalizing memory, skills, and protocols.
This explores whether the entity-centric memory graphs that multimodal agents use to track people and objects fit naturally into the 'three-axis' view that says reliable agents come from externalizing memory, skills, and protocols. The short version the corpus suggests: yes — and entity-centric graphs are essentially a worked example of one of those three axes, the memory axis, done well.
The three-axis idea comes from work arguing that reliable agents don't get their reliability from bigger models. They get it by pushing three cognitive burdens out of the model and into a surrounding 'harness': memory (keeping state across time), skills (reusable procedures), and protocols (structured ways of interacting) Where does agent reliability actually come from?. Read against that, a multimodal agent's entity-centric graph is exactly the memory burden externalized — instead of asking the model to re-derive who someone is on every turn, the agent stores a persistent graph node per entity and binds new observations to it.
The multimodal piece is where this gets interesting. M3-Agent shows that an entity-centric graph which separates episodic events ('what happened') from semantic knowledge ('what's true about this person') lets an agent learn your preferences by watching across video and audio, without ever asking Can agents learn preferences by watching rather than asking?. That separation is itself a design choice about how memory is structured — meaning the memory axis isn't a single slot but has its own internal architecture. MegaRAG pushes the same instinct in a different direction: it builds hierarchical multimodal knowledge graphs over books where images are first-class nodes, enabling 'global' reasoning across chapters that flat chunk-retrieval simply can't reach Can multimodal knowledge graphs answer questions that flat retrieval cannot?. Both treat the graph not as a lookup table but as the substrate that makes higher-order reasoning possible.
Here's the thing you might not have expected: graphs aren't just passive storage in this framework — they can behave like a living system. Analysis of iterative graph reasoning shows it self-organizes toward a critical state where roughly 12% of edges stay 'semantically surprising' even after being structurally connected, and that residual surprise is what keeps the agent discovering new connections Why do reasoning systems keep discovering new connections?. So an entity-centric graph can do double duty: it's the memory axis, but its structure also feeds the reasoning the skills axis depends on. And the same graph abstraction reaches the third axis too — language agents can be represented as optimizable computational graphs where nodes are operations and edges are information flow, which is really the protocol axis written as a graph Can we automatically optimize both prompts and agent coordination?.
The payoff: entity-centric graphs don't just slot into one axis of the framework — graph structure turns out to be a shared language across all three. Memory is a graph of entities, reasoning is a graph that stays productively unsettled, and coordination is a graph of operations. A multimodal agent using entity-centric graphs isn't bolting a feature onto the three-axis model; it's expressing the model in its most natural form.
Sources 5 notes
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.
MegaRAG builds hierarchical multimodal knowledge graphs from text and visuals to answer cross-chapter, global questions that flat chunk retrieval cannot reach. The hierarchy supports abstraction levels from high-level summaries to page-specific details while treating images as first-class graph nodes.
Analysis shows iterative graph reasoning evolves toward a stable phase where semantic entropy persistently dominates structural entropy, with ~12% of edges remaining semantically surprising despite structural connection, fueling ongoing discovery.
Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.