Can visual similarity alone guide robot object retrieval?
Visual retrieval works for text QA but fails for embodied agents—the most visually similar object may be unreachable or locked. Should retrieval systems for robots rank by what the agent can physically execute instead?
Standard multimodal retrieval ranks candidates by visual or semantic similarity to a query — useful for question answering but disastrous for embodied agents, because the most visually similar object may be unreachable, immovable, or behind a closed door. AffordanceRAG adds an affordance reranking step on top of visual retrieval: it builds an affordance-aware memory from images of the explored environment, retrieves objects and locations by visual and regional features, and then reranks them by whether the robot can physically execute an action on them.
The conceptual move is treating affordance — what the agent can do with the object — as a first-class retrieval signal rather than a downstream filter. This matters because the failure modes of visually-similar-but-unactionable retrieval are not easily corrected at action time: by the time the planner discovers the cabinet is locked or the cup is too high, the system has already committed to a plan around it. Reranking by affordance during retrieval prunes these dead ends before they become plans.
More broadly the work argues that RAG for embodied agents needs a different similarity function from RAG for text. The grounding criterion is not "this passage answers the question" but "this object permits the action." Carrying that distinction into retrieval architecture rather than treating it as a post-hoc check is what makes zero-shot mobile manipulation tractable without task-specific training. The general pattern of replacing similarity-based ranking with task-aware ranking also surfaces in Can rationale-driven selection beat similarity re-ranking for evidence? (where rationale replaces semantic similarity) and in Can interleaving reasoning with real-world feedback prevent hallucination? (where action feedback corrects model-internal associations).
Inquiring lines that use this note as a source 8
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What role does visual perception play alongside accessibility tree information?
- Why does visual similarity retrieval fail for embodied agents?
- How can affordance become a primary retrieval signal instead of a filter?
- Why do vector embeddings fail for sequential procedural retrieval tasks?
- How should visual content be connected to text within a unified knowledge representation?
- Why does text-mediated retrieval avoid the embedding dimension limits of visual similarity?
- Can small transformers trained on similarity maps replace dense retrievers entirely?
- Can multimodal architectures successfully integrate vision without replicating past failures?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can rationale-driven selection beat similarity re-ranking for evidence?
Can LLMs generate search guidance that outperforms traditional similarity-based evidence ranking? This matters because current re-ranking lacks interpretability and fails against adversarial attacks.
extends: same architectural move of replacing similarity scoring with task-grounded scoring (rationale for QA, affordance for embodied action); both keep retrieval but install a different ranking criterion
-
Can interleaving reasoning with real-world feedback prevent hallucination?
Does grounding language model reasoning in external world observations rather than internal associations help prevent error propagation and false outputs? This explores whether breaking the static chain-of-thought pattern can catch and correct mistakes in real time.
extends: both use real-world executability rather than model-internal representations to constrain output; AffordanceRAG does this at retrieval time, ReAct does it at reasoning time
-
Do embedding dimensions fundamentally limit retrievable document combinations?
Can single-vector embeddings represent any top-k document subset a user might need? Research using communication complexity theory suggests there are hard geometric limits independent of training data or model architecture.
supports: motivates why visual-similarity retrieval alone fails in embodied settings — embedding similarity cannot encode action constraints
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- On the Theoretical Limitations of Embedding-Based Retrieval
- Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
- Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
- ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- Agent S: An Open Agentic Framework that Uses Computers Like a Human
Original note title
affordance-aware retrieval reranks robot perception by physical executability — visual similarity alone retrieves objects the robot cannot actually act on