SYNTHESIS NOTE
Model Architecture and Internals Reasoning, Retrieval, and Evaluation Agentic Systems and Tool Use

Can visual similarity alone guide robot object retrieval?

Visual retrieval works for text QA but fails for embodied agents—the most visually similar object may be unreachable or locked. Should retrieval systems for robots rank by what the agent can physically execute instead?

Synthesis note · 2026-05-03
How should retrieval and reasoning integrate in RAG systems?

Standard multimodal retrieval ranks candidates by visual or semantic similarity to a query — useful for question answering but disastrous for embodied agents, because the most visually similar object may be unreachable, immovable, or behind a closed door. AffordanceRAG adds an affordance reranking step on top of visual retrieval: it builds an affordance-aware memory from images of the explored environment, retrieves objects and locations by visual and regional features, and then reranks them by whether the robot can physically execute an action on them.

The conceptual move is treating affordance — what the agent can do with the object — as a first-class retrieval signal rather than a downstream filter. This matters because the failure modes of visually-similar-but-unactionable retrieval are not easily corrected at action time: by the time the planner discovers the cabinet is locked or the cup is too high, the system has already committed to a plan around it. Reranking by affordance during retrieval prunes these dead ends before they become plans.

More broadly the work argues that RAG for embodied agents needs a different similarity function from RAG for text. The grounding criterion is not "this passage answers the question" but "this object permits the action." Carrying that distinction into retrieval architecture rather than treating it as a post-hoc check is what makes zero-shot mobile manipulation tractable without task-specific training. The general pattern of replacing similarity-based ranking with task-aware ranking also surfaces in Can rationale-driven selection beat similarity re-ranking for evidence? (where rationale replaces semantic similarity) and in Can interleaving reasoning with real-world feedback prevent hallucination? (where action feedback corrects model-internal associations).

Inquiring lines that use this note as a source 8

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 129 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

affordance-aware retrieval reranks robot perception by physical executability — visual similarity alone retrieves objects the robot cannot actually act on