TOPIC

Agentic Research and Workflows

10 synthesis notes · 22 source papers

View as

Where does AI assistance become unreliable in research?

This explores whether AI capability follows a sharp boundary in research tasks, and what determines which side of that line a task falls on. Understanding this matters because it reveals where humans must stay in control.

Can AI verify research outputs as fast as it generates them?

Research suggests AI systems produce plausible findings rapidly but struggle to verify them at the same pace. This creates a bottleneck in verification across all research stages. Understanding this gap matters for assessing when AI assistance is reliable versus risky.

Can automated review loops handle AI-generated research at scale?

As AI agents produce papers faster than humans can evaluate them, can a closed-loop automated review system with retrieval-augmented feedback actually improve quality and catch problems traditional peer review misses?

Do autonomous research mechanisms work better together than apart?

AutoResearchClaw's five mechanisms—debate, self-healing, verification, cross-run evolution, and human oversight—may interact in ways that removing them together causes worse damage than removing each alone. Does this super-additivity hold across other agentic systems?

Why do deep research agents fabricate scholarly content?

Explores whether AI research agents deliberately invent plausible-sounding academic constructs to meet user demands for depth and comprehensiveness, and what drives this behavior.

Does more automation actually hide rather than eliminate errors?

As AI systems become more polished, do they mask failures instead of preventing them? This matters because it changes whether we should focus on detecting problems or governing their disclosure.

When do multi-agent systems actually outperform single agents?

As individual LLMs grow more capable, does the advantage of splitting work across multiple agents still hold? This explores when coordination overhead makes MAS counterproductive.

Why do production AI agents stay deliberately simple?

Production AI agents operate far simpler than research suggests—most execute under 10 steps and avoid third-party frameworks. What explains this gap between research ambition and deployment reality?

Does targeted human intervention outperform both full autonomy and exhaustive oversight?

This research explores whether selectively routing high-stakes decisions to humans beats the extremes of letting systems run unsupervised or requiring approval at every step. The question tests whether the optimal human-AI collaboration point lies between these endpoints.

Can experiment failures drive progress instead of stopping it?

Explores whether autonomous research systems can treat failed runs as information rather than termination signals. This matters because real science is iterative, and systems that halt on errors cannot learn from failure.

Source papers 22

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

A Survey on Context-Aware Multi-Agent Systems: Techniques, Challenges and Future Directions
Research interest in agents is rising, particularly in Artificial Intelligence (AI) techniques such as deep multi-modal representation learning, deep graph learning, and deep reinforcement learning, w…
AI Can Learn Scientific Taste
Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potent…
AI for Auto-Research: Roadmap & User Guide
AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and s…
Agentic AI and the next intelligence explosion
By its nature, intelligence is high-dimensional and relational, not a single quantity that must be unambiguously less or greater than human scale. In fact, it is unclear what we even mean by “human sc…
Agents of Chaos
LLM-powered AI agents are rapidly becoming more capable and more widely deployed (Masterman et al., 2024; Kasirzadeh & Gabriel, 2025). Unlike conventional chat assistants, these systems are increasing…
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next …
Bilevel Autoresearch: Meta-Autoresearching Itself
If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We take this idea literally: we use an autoresearch loop to optimize the autoresearch loop. Every exi…
Deep Research: A Systematic Survey
Abstract: Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, whi…
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
First, data-wise, most existing QA datasets usually feature relatively simple questions that do not reflect true “hard-to-find” cases. For example, questions in HotpotQA [Yang et al., 2018] can often …
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
Deep research systems represent an emerging class of agentic information retrieval methods that generate comprehensive and well-supported reports to complex queries. However, most existing frameworks …
How Far Are We from Genuinely Useful Deep Research Agents?
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering b…
Measuring Agents in Production
AI agents are already operating in production across many industries, yet there is limited public understanding of the technical strategies that make deployments successful. We present the first large…
Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study
Background: Large language model systems are commonly evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when a persistent AI agent is embedded into a …
QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks
Deep research agents extend the role of search engines from retrieving keyword-matched pages to synthesizing knowledge, fundamentally changing how humans interact with information. However, frontier s…
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Existing Large Reasoning Models (LRMs) have shown the potential of reinforcement learning (RL) to enhance the complex reasoning capabilities of Large Language Models (LLMs). While they achieve remarka…
Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents
Existing question answering (QA) datasets are no longer challenging to most powerful Large Language Models (LLMs). Traditional QA benchmarks like TriviaQA, NaturalQuestions, ELI5 and HotpotQA mainly s…
Rethinking Memory as Continuously Evolving Connectivity
Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelines, which is brittle in dynamic agentic environments where fe…
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning cap…
Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
Recent work reports strong performance from multi-agent LLM systems (MAS), but these gains are often confounded by increased test-time computation. When computation is normalized, single-agent systems…
Single-agent or Multi-agent Systems? Why Not Both?
Multi-agent systems (MAS) decompose complex tasks and delegate subtasks to different large language model (LLM) agents and tools. Prior studies have reported the superior accuracy performance of MAS a…
Towards a Science of Scaling Agent Systems
Agents, language model (LM)-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the p…
aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists
Recent advances in large language models (LLMs) have enabled AI agents to autonomously generate scientific proposals, conduct experiments, author papers, and perform peer reviews. Yet this flood of AI…