Why do queries and documents occupy different embedding spaces?
Queries and documents express the same information in fundamentally different ways—short and interrogative versus long and declarative. Understanding this mismatch is crucial for why direct embedding retrieval often fails.
The standard embedding retrieval pipeline maps a query directly to a vector and finds nearby document vectors. This assumes that a query and a relevant document occupy nearby regions of the embedding space. They often do not. Queries are short, telegraphic, and interrogative. Relevant documents are long, detailed, and declarative. The same information expressed in query form and document form looks different to an encoder trained on natural language co-occurrence.
HyDE (Hypothetical Document Embeddings) decomposes retrieval into two steps that exploit this asymmetry. First: ask an instruction-following LLM to generate a hypothetical document that would answer the query — not a real document, but something that looks like one. Second: embed the hypothetical document and use document-document similarity to find real corpus matches. The encoder, trained on documents-to-documents, now operates in its natural space.
The generated document may be factually wrong — it is, in the FLARE framing, a hallucination on purpose. But factual accuracy is not the goal. Relevance pattern is the goal. The hypothetical document "captures relevance by example": it demonstrates what a relevant document looks like in terms of style, terminology, and structure. The encoder's dense bottleneck filters out hallucinated details while preserving the embedding signature of relevant content.
The implication is that the query is the wrong level of abstraction for retrieval. Queries work well when they are complete enough to uniquely identify relevant content — which is why they succeed on short-form factoid QA but fail on complex or underspecified queries. Hypothetical documents circumvent this by translating the query into the same genre as the targets.
The approach requires no relevance labels and no retrieval-specific fine-tuning — only an instruction-following LLM and an unsupervised contrastive encoder. On 11 query sets spanning web search, question answering, and fact verification, HyDE with InstructGPT and Contriever significantly outperforms the zero-shot no-relevance baseline.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do Doc2Query approaches suffer from the same misaligned-target problem?
- How does embedding dimension affect which documents can rank together?
- Why does text encoding create different subspaces across domains?
- Why does document-document similarity work better than query-document matching?
- Why do embedding-based retrieval systems fail on vocabulary mismatch?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do language models actually build shared understanding in conversation?
When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
the grounding gap in dialogue; HyDE is an example of building common ground in retrieval by generating an intermediate representation
-
Can prompt optimization teach models knowledge they lack?
Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.
HyDE works because the LLM already has enough knowledge to write a plausible answer; the generation activates a latent representation useful for retrieval
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Precise Zero-Shot Dense Retrieval without Relevance Labels
- Searching for Best Practices in Retrieval-Augmented Generation
- On the Theoretical Limitations of Embedding-Based Retrieval
- Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
- CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
- Large Language Models Know Your Contextual Search Intent: A Prompting Framework for Conversational Search
- Training for Compositional Sensitivity Reduces Dense Retrieval Generalization
- Generator-Retriever-Generator: A Novel Approach to Open-domain Question Answering
Original note title
query-document vocabulary mismatch makes direct embedding retrieval suboptimal — hypothetical document bridging resolves it