How should visual content be connected to text within a unified knowledge representation?

This explores how images and text can live inside a single knowledge structure—not as separate retrieval lanes, but connected so a system can reason across both—and the corpus surfaces three competing strategies: make images first-class graph nodes, translate images into text so they share one index, or keep modalities separate but link them through structure.

This explores how images and text can be joined inside one knowledge representation rather than searched in parallel silos. The corpus points to three distinct answers, and the interesting part is that they disagree about *where* the connection should happen.

The most direct approach treats images as full citizens of a graph. MegaRAG builds hierarchical multimodal knowledge graphs over books where visuals are first-class nodes sitting alongside text, linked across abstraction levels from chapter summaries down to page details—which lets it answer cross-chapter, global questions that flat chunk retrieval simply can't reach Can multimodal knowledge graphs answer questions that flat retrieval cannot?. Here the visual content keeps its own identity but is wired into the same structure as the text, so reasoning can hop between a diagram and the paragraph that explains it.

The opposite strategy dissolves the visual into text entirely. SignRAG describes an unknown image with a vision-language model and then retrieves matches from a *text*-indexed database—and finds that a natural-language description bridges the visual-reference gap better than direct embedding similarity Can describing images in text improve zero-shot recognition?. In this view the cleanest unified representation is just text: convert everything to language and you get one index, one retrieval mechanism, no cross-modal plumbing. But there's a cost the corpus flags loudly—text is a lossy abstraction that strips out the physics, geometry, and causality present in the original, producing predictable failure modes in exactly the reasoning that images were carrying Are text-only language models fundamentally limited by abstraction?. So 'just describe it in words' buys simplicity by throwing away what made the image worth keeping.

A third camp keeps modalities separate but couples them through layered structure. Agent S pairs raw visual input (for environmental understanding) with image-augmented accessibility trees (for grounding), deliberately factoring perception and planning into separate optimization paths instead of forcing one end-to-end blob Can structured interfaces help language models control GUIs better?. CoCoT makes a similar bet inside a single model, scaffolding visual reasoning through staged steps and finding that cognitive *structure* matters more than reasoning volume Can breaking down visual reasoning into three stages improve model performance?. The lesson that travels across both: the visual–text link works better when the architecture names the relationship explicitly rather than hoping a shared embedding space figures it out.

What you didn't know you wanted to know is that this isn't really a multimodal question—it's the same structure-versus-similarity argument the graph-retrieval papers are having about pure text. SymAgent shows that symbolic rules derived from graph topology beat plain semantic similarity because they capture structural patterns explicitly Can symbolic rules from knowledge graphs guide complex reasoning?, and AffordanceRAG shows that visual similarity alone misleads a robot until you rerank by what's physically executable Can visual similarity alone guide robot object retrieval?. Both echo the multimodal finding: connecting visual content to text isn't about finding a single vector where a picture and a sentence land near each other—it's about encoding the *relationship* between them (containment, grounding, affordance, role) as explicit structure. The unified representation that works is the one where the edges, not just the nodes, carry meaning.

Sources 7 notes

Can multimodal knowledge graphs answer questions that flat retrieval cannot?

MegaRAG builds hierarchical multimodal knowledge graphs from text and visuals to answer cross-chapter, global questions that flat chunk retrieval cannot reach. The hierarchy supports abstraction levels from high-level summaries to page-specific details while treating images as first-class graph nodes.

Can describing images in text improve zero-shot recognition?

SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Can structured interfaces help language models control GUIs better?

Agent S's dual-input design—visual input for environmental understanding plus image-augmented accessibility trees for grounding—achieved 9.37% improvement over baseline by factoring planning and grounding into separate optimization paths rather than forcing end-to-end prediction.

Can breaking down visual reasoning into three stages improve model performance?

CoCoT structures VLM reasoning through embodied perception, embedded situation analysis, and norm-grounded interpretation, achieving +8% improvement over flat CoT on social benchmarks. The gains suggest cognitive structure matters more than reasoning volume for social tasks.

Can symbolic rules from knowledge graphs guide complex reasoning?

SymAgent derives symbolic rules from KG structure using LLM reasoning to create navigational plans that align natural language with graph topology. This approach captures structural reasoning patterns explicitly, outperforming retrieval methods that rely on semantic similarity alone.

AffordanceRAG reranks visually retrieved objects by affordance scores, ensuring the robot selects only physically executable actions. This architectural shift from similarity to task-grounded ranking prevents plans that fail at execution time.

How should visual content be connected to text within a unified knowledge representation?

Sources 7 notes

Next inquiring lines