Can visual similarity alone guide robot object retrieval?

Visual retrieval works for text QA but fails for embodied agents—the most visually similar object may be unreachable or locked. Should retrieval systems for robots rank by what the agent can physically execute instead?

Synthesis note · 2026-05-03

Standard multimodal retrieval ranks candidates by visual or semantic similarity to a query — useful for question answering but disastrous for embodied agents, because the most visually similar object may be unreachable, immovable, or behind a closed door. AffordanceRAG adds an affordance reranking step on top of visual retrieval: it builds an affordance-aware memory from images of the explored environment, retrieves objects and locations by visual and regional features, and then reranks them by whether the robot can physically execute an action on them.

The conceptual move is treating affordance — what the agent can do with the object — as a first-class retrieval signal rather than a downstream filter. This matters because the failure modes of visually-similar-but-unactionable retrieval are not easily corrected at action time: by the time the planner discovers the cabinet is locked or the cup is too high, the system has already committed to a plan around it. Reranking by affordance during retrieval prunes these dead ends before they become plans.

More broadly the work argues that RAG for embodied agents needs a different similarity function from RAG for text. The grounding criterion is not "this passage answers the question" but "this object permits the action." Carrying that distinction into retrieval architecture rather than treating it as a post-hoc check is what makes zero-shot mobile manipulation tractable without task-specific training. The general pattern of replacing similarity-based ranking with task-aware ranking also surfaces in Can rationale-driven selection beat similarity re-ranking for evidence? (where rationale replaces semantic similarity) and in Can interleaving reasoning with real-world feedback prevent hallucination? (where action feedback corrects model-internal associations).

Inquiring lines that use this note as a source 8

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 129 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can rationale-driven selection beat similarity re-ranking for evidence? Can LLMs generate search guidance that outperforms traditional similarity-based evidence ranking? This matters because current re-ranking lacks interpretability and fails against adversarial attacks.
extends: same architectural move of replacing similarity scoring with task-grounded scoring (rationale for QA, affordance for embodied action); both keep retrieval but install a different ranking criterion
Can interleaving reasoning with real-world feedback prevent hallucination? Does grounding language model reasoning in external world observations rather than internal associations help prevent error propagation and false outputs? This explores whether breaking the static chain-of-thought pattern can catch and correct mistakes in real time.
extends: both use real-world executability rather than model-internal representations to constrain output; AffordanceRAG does this at retrieval time, ReAct does it at reasoning time
Do embedding dimensions fundamentally limit retrievable document combinations? Can single-vector embeddings represent any top-k document subset a user might need? Research using communication complexity theory suggests there are hard geometric limits independent of training data or model architecture.
supports: motivates why visual-similarity retrieval alone fails in embodied settings — embedding similarity cannot encode action constraints

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

affordance-aware retrieval reranks robot perception by physical executability — visual similarity alone retrieves objects the robot cannot actually act on

Can visual similarity alone guide robot object retrieval?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4