What makes deep research fundamentally different from RAG?

Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.

Synthesis note · 2026-02-21 · sourced from Deep Research

"Deep research" is used loosely to describe anything from a single web search to a multi-hour autonomous investigation. The Characterizing Deep Research paper proposes a formal three-component definition that makes the boundary precise:

Multi-step information gathering — not one retrieval round but a sequence of them, where each round can expand or contract the search space
Cross-source synthesis — combining findings from multiple independent sources, not just summarizing one document
Iterative query refinement — using partial findings to improve subsequent queries, not issuing all queries upfront

The definition excludes single-step RAG (fails component 1), document summarization (fails component 3), and simple web browsing (may fail component 2). It includes only systems that loop across all three simultaneously.

The practical value of the definition is benchmarking clarity. Without it, systems that perform single-step retrieval with sophisticated synthesis can claim "deep research" capability when they lack the iterative refinement component that actually distinguishes DR from RAG++. PRELUDE (the benchmark that accompanies the paper) evaluates all three components, making it possible to locate exactly where a system falls short.

This also clarifies what the TTS law applies to: Does search budget scale like reasoning tokens for answer quality? is a scaling law specifically for systems that meet the full three-component definition. Partial systems that skip iterative query refinement likely show different scaling behavior.

Researchy Questions (2024) operationalizes the "unknown unknowns" concept for deep research. Unlike standard QA benchmarks that study "known unknowns" with clear indications of what information is missing, Researchy Questions identifies non-factoid, multi-perspective, decompositional questions from real search engine logs — questions where the questioner doesn't know what they don't know. Users spend significantly more effort (clicks, session length) on these queries, and "slow thinking" techniques like decomposition into sub-questions show benefit over direct answering. An 8-dimension quality rubric (ambiguity, incompleteness, assumptions, multi-facetedness, knowledge-intensity, subjectivity, reasoning-intensity, harmfulness) provides granular characterization. This distinguishes "deep" questions from merely "hard" ones: a deep question has multiple perspectives allowing a dense manifold of answers, no single correct answer, and requires genuine synthesis rather than just retrieval. Source: Arxiv/Agentic Research.

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 115 in 2-hop network ·medium cluster Open in graph ↗

What makes deep research fundamentally different… Does search budget scale like reasoning tokens for… Do hierarchical retrieval architectures outperform…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does search budget scale like reasoning tokens for answer quality? Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
grounds: the TTS law applies specifically to systems meeting this formal definition; the three components define what search budget measures
Do hierarchical retrieval architectures outperform flat ones on complex queries? Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.
connects: hierarchical architecture is the structural implementation of the three-component definition

What makes deep research fundamentally different from RAG?

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 5