TOPIC

Deep Research Agents

15 synthesis notes · 53 source papers

View as

Can schema-free graphs objectively evaluate open-ended search?

Can a directed graph with no preset structure capture the complexity of real search outputs while still enabling objective, fine-grained evaluation? This matters because existing evaluation methods trade objectivity for rigidity or richness for subjectivity.

Does search budget scale like reasoning tokens for answer quality?

Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.

Can agents compress their own memory without losing critical details?

Explores whether agents can autonomously consolidate interaction history into structured memory schemas that reduce token overhead while preserving information needed for long-horizon reasoning and strategic reflection.

What capabilities do AI systems need for autonomous science?

Explores whether current AI benchmarks actually measure what's required for independent scientific research—hypothesis generation, experimental design, data analysis, and self-correction—or if they test only adjacent skills.

Can codified expertise let non-experts match specialist output?

When domain knowledge is captured as explicit rules and principles in an AI agent's scaffolding, can non-experts produce work at expert quality levels without consuming scarce specialist time? This explores whether structured knowledge codification dissolves organizational bottlenecks.

Why do search agents beat memorized retrieval on hard questions?

Deep research agents trained on live web search outperform models fine-tuned on static knowledge. Does real-world RL's advantage come from smarter reasoning, or from bypassing the limitations of memorized facts?

What makes deep research fundamentally different from RAG?

Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.

Can agents discover tools dynamically instead of pre-selecting them?

Explore whether agents can find needed tools during execution rather than choosing from a fixed set upfront. This matters for long-horizon tasks where relevant tools cannot be known in advance.

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.

Can models learn better by training on messy exploration paths?

Does including trial-and-error, reflection, and backtracking in training data teach models to reason more robustly than teaching only the polished shortest path to answers?

Does limiting reasoning per turn improve multi-turn search quality?

When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?

Does reinforcement learning squeeze exploration diversity in search agents?

Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.

Why do search agents fail users despite strong benchmark scores?

Search evaluation benchmarks show high performance, yet real users remain unsatisfied. What gaps between test conditions and actual search behavior explain this disconnect?

Do search steps follow the same scaling rules as reasoning tokens?

Exploring whether the overthinking curve observed in reasoning models also appears in deep research agents. This matters because it could reveal universal scaling laws governing all inference-time compute.

Can simulated APIs and token-level credit assignment train better tool-using agents?

Training agents to use real APIs is expensive and unstable, and sparse rewards make it hard to credit the right tool calls. Can combining LLM simulators with fine-grained advantage attribution solve both problems?

Source papers 53

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications
This survey examines the rapidly evolving field of Deep Research systems—AI-powered applications that automate complex research workflows through the integration of large language models, advanced inf…
AI for Auto-Research: Roadmap & User Guide
AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and s…
AI-Researcher: Autonomous Scientific Innovation
The powerful reasoning capabilities of Large Language Models (LLMs) in mathematics and coding, combined with their ability to automate complex tasks through agentic frameworks, present unprecedented o…
Agent Laboratory: Using LLM Agents as Research Assistants
This framework accepts a human-provided research idea and progresses through three stages—literature review, experimentation, and report writing to produce comprehensive research outputs, including a …
AgentRxiv: Towards Collaborative Autonomous Research
To address these challenges, we introduce AgentRxiv—a framework that lets LLM agent laboratories upload and retrieve reports from a shared preprint server in order to collaborate, share insights, and …
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
Agentic Reasoning, a framework1 that enhances large language model (LLM) reasoning by integrating external tool-using agents. Unlike conventional LLM-based reasoning approaches, which rely solely on i…
Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward
Large language models (LLMs) exhibit remarkable problem-solving abilities, but struggle with complex tasks due to static internal knowledge. Retrieval-Augmented Generation (RAG) enhances access to ext…
Autotelic Agents with Intrinsically Motivated Goal-Conditioned Reinforcement Learning: a Short Survey
Building autonomous machines that can explore open-ended environments, discover possible interactions and build repertoires of skills is a general objective of artificial intelligence. Developmental a…
Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback
Novelty assessment is a central yet understudied aspect of peer review, particularly in highvolume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach fo…
Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing
Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient…
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search t…
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autono…
Characterizing Deep Research: A Benchmark and Formal Definition
Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of deep research — a term also adopted by recent…
Deep Research: A Systematic Survey
Abstract: Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, whi…
Deep Researcher with Test-Time Diffusion
Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time …
Deep Think with Confidence
Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishi…
DeepAgent: A General Reasoning Agent with Scalable Toolsets
Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow p…
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
First, data-wise, most existing QA datasets usually feature relatively simple questions that do not reflect true “hard-to-find” cases. For example, questions in HotpotQA [Yang et al., 2018] can often …
DeepNet: Scaling Transformers to 1,000 Layers
In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DEEPNORM) to modify the residual connection i…
DeepRAG: Thinking to Retrieval Step by Step for Large Language Models
Large Language Models (LLMs) have shown remarkable potential in reasoning while they still suffer from severe factual hallucinations due to timeliness, accuracy, and coverage of parametric knowledge. …
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
Deep research systems represent an emerging class of agentic information retrieval methods that generate comprehensive and well-supported reports to complex queries. However, most existing frameworks …
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Large Language Models (LLMs) equipped with web search capabilities have demonstrated impressive potential for deep research tasks. However, current approaches predominantly rely on either manually eng…
Evolving Deeper LLM Thinking
We explore an evolutionary search strategy for scaling inference time compute in Large Language Models. The proposed approach, Mind Evolution, uses a language model to generate, recombine and refine c…
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reason…
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Sparse MoE models activate a subset of model parameters per input token by learning to dynamically route tokens to a subset of parameters (experts); this allows them to decouple total model capacity f…
HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches
Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing…
How Far Are We from Genuinely Useful Deep Research Agents?
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering b…
How to Build AI Agents by Augmenting LLMs with Codified Human Expert Domain Knowledge? A Software Engineering Framework
Critical domain knowledge typically resides with few experts, creating organizational bottlenecks in scalability and decision-making. Non-experts struggle to create effective visualizations, leading t…
MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
Agents based on large language models (LLMs) for machine learning engineering (MLE) can automatically implement ML models via code generation. However, existing approaches to build such agents often r…
MasRouter: Learning to Route LLMs for Multi-Agent Systems
Multi-agent systems (MAS) powered by Large Language Models (LLMs) have been demonstrated to push the boundaries of LLM capabilities, yet they often incur significant costs and face challenges in dynam…
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications
We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies (e.g., supervised fine-tuning, reinforcement learning) and test-time mechanisms (e.g., prompt engin…
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
Information seeking and integration is a complex cognitive task that consumes enormous time and effort. Search engines reshape the way of seeking information but often fail to align with complex human…
Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say
Open-source Large Language Models (LLMs) increasingly specialize by domain (e.g., math, code, general reasoning), motivating systems that leverage complementary strengths across models. Prior multi-LL…
O1 Replication Journey: A Strategic Progress Report -- Part 1
This paper introduces a pioneering approach to artificial intelligence research, embodied in our O1 Replication Journey. In response to the announcement of OpenAI’s groundbreaking O1 model, we embark …
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character’s prequel story is consistent with the canonical narrative of the origin…
RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) has shown great promise for knowledge-intensive tasks and recently advanced with agentic RAG, where language agents engage in multi-round interactions with externa…
RL + Transformer = A General-Purpose Problem Solver
What if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., metalearn)? In this study, we demonstrate that a pre-…
RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abiliti…
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as …
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs
Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model fi…
Reasoning Language Models: A Blueprint
such as OpenAI’s o1 and o3, DeepSeek-V3, and Alibaba’s QwQ, have redefined AI’s problem-solving capabilities by extending large language models (LLMs) with advanced reasoning mechanisms. Yet, their hi…
Retrieval-augmented reasoning with lean language models
This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically r…
Search Arena: Analyzing Search-Augmented LLMs
Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing dat…
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Ret…
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multistep inference; conversely, p…
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. …
TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervisi…
UR2: Unify RAG and Reasoning through Reinforcement Learning
Large Language Models (LLMs) have shown remarkable capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG), which enhances knowledge grounding, and Reinforcement Learnin…
Universe of Thoughts: Enabling Creative Reasoning with Large Language Models
Reasoning based on Large Language Models (LLMs) has garnered increasing attention due to outstanding performance of these models in mathematical and complex logical tasks. Beginning with the Chain-of-…
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation–experience gap. We attribute this gap to existing benchmarks'…
Virtuous Machines: Towards Artificial General Science
Artificial intelligence systems are transforming scientific discovery by accelerating specific research tasks, from protein structure prediction to materials design, yet remain confined to narrow doma…
What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, an…
Why Do Multi-agent LLM Systems Fail?
[[Routers]] Despite growing enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains across popular benchmarks often remain minimal compared to single-agent frameworks. This gap highlig…