How does search scale like reasoning in agent systems?

How test-time scaling applies to search and reasoning in agentic deep research systems.

Topic Hub · 38 linked notes · 8 sections

View as

Deep Research and Search Scaling

10 notes

Does search budget scale like reasoning tokens for answer quality?

Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.

Why do search agents beat memorized retrieval on hard questions?

Deep research agents trained on live web search outperform models fine-tuned on static knowledge. Does real-world RL's advantage come from smarter reasoning, or from bypassing the limitations of memorized facts?

Does limiting reasoning per turn improve multi-turn search quality?

When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.

What makes deep research fundamentally different from RAG?

Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.

Does reinforcement learning squeeze exploration diversity in search agents?

Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.

What capabilities do AI systems need for autonomous science?

Explores whether current AI benchmarks actually measure what's required for independent scientific research—hypothesis generation, experimental design, data analysis, and self-correction—or if they test only adjacent skills.

Why do deep research agents fabricate scholarly content?

Explores whether AI research agents deliberately invent plausible-sounding academic constructs to meet user demands for depth and comprehensiveness, and what drives this behavior.

Can structured pipelines make LLM novelty assessment reliable?

Explores whether breaking novelty assessment into extraction, retrieval, and comparison stages helps LLMs align with human peer reviewers and produce more rigorous, evidence-based evaluations.

Can specialized agents write better scientific papers than single models?

Multi-agent frameworks decompose writing into specialized subtasks. This explores whether distributed agents maintaining cross-document consistency outperform single-model approaches on manuscript quality and literature synthesis.

Proactive Search Evaluation (2026-05-28 — VibeSearchBench)

5 notes

Why do search agents fail users despite strong benchmark scores?

Search evaluation benchmarks show high performance, yet real users remain unsatisfied. What gaps between test conditions and actual search behavior explain this disconnect?

Can schema-free graphs objectively evaluate open-ended search?

Can a directed graph with no preset structure capture the complexity of real search outputs while still enabling objective, fine-grained evaluation? This matters because existing evaluation methods trade objectivity for rigidity or richness for subjectivity.

Where does AI assistance become unreliable in research?

This explores whether AI capability follows a sharp boundary in research tasks, and what determines which side of that line a task falls on. Understanding this matters because it reveals where humans must stay in control.

Can AI verify research outputs as fast as it generates them?

Research suggests AI systems produce plausible findings rapidly but struggle to verify them at the same pace. This creates a bottleneck in verification across all research stages. Understanding this gap matters for assessing when AI assistance is reliable versus risky.

Does more automation actually hide rather than eliminate errors?

As AI systems become more polished, do they mask failures instead of preventing them? This matters because it changes whether we should focus on detecting problems or governing their disclosure.

Production Deployment

1 note

Why do production AI agents stay deliberately simple?

Production AI agents operate far simpler than research suggests—most execute under 10 steps and avoid third-party frameworks. What explains this gap between research ambition and deployment reality?

Agentic Systems

5 notes

Can careful selection of 78 demos outperform massive training datasets?

Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.

Can we automatically optimize both prompts and agent coordination?

This explores whether language agents can be represented as computational graphs whose structure and content adapt automatically. Why it matters: current agent systems require hand-engineered orchestration; automatic optimization could unlock more capable multi-agent systems.

Can multi-agent teams automatically remove their weakest members?

Explores whether agents can score each other's contributions during problem-solving and use those scores to deactivate underperforming teammates in real time, improving overall team efficiency.

Can agents learn new skills without forgetting old ones?

Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.

How do agentic AI systems decompose into adaptation paradigms?

What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.

Knowledge Graph Reasoning and Training Data Synthesis

5 notes

Why do reasoning systems keep discovering new connections?

Explores whether agentic graph reasoning systems maintain a special balance between semantic diversity and structural organization that enables continuous discovery of novel conceptual relationships.

Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?

Can externalizing LLM reasoning into structured knowledge graph triples enable smaller, cheaper models to match the performance of much larger ones? This explores whether making reasoning explicit and inspectable improves both capability and transparency.

Can knowledge graphs generate training data for search agents?

Exploring whether synthesizing questions from knowledge graph random walks with entity blurring can create the hard-to-find training data needed to teach deep search agents to reason and search effectively.

Can symbolic rules from knowledge graphs guide complex reasoning?

Can deriving symbolic rules directly from knowledge graph structure help align natural language questions with structured reasoning paths? This explores whether explicit structural patterns outperform semantic similarity for multi-hop inference.

Can iterative revision cycles match how humans actually write?

Does framing research writing as a diffusion process—where drafts are refined through retrieval-augmented cycles—better capture human cognition than linear pipelines and reduce information loss?

Autonomous Science and Ideation — Batch #3 backlog (2026-06-03)

3 notes

Can AI research itself without losing human oversight?

Explores whether AI systems can internalize the human judgment and insight-distillation that normally drives research progress, and what this means for maintaining meaningful human control over AI advancement.

Can decentralized teams outperform central planners in long-running science?

Explores whether autonomous agent teams that self-organize around competing hypotheses and share failures can achieve better experimental outcomes than centrally-planned approaches, especially under fixed research budgets.

Why do LLMs generate ideas the research community already explores?

LLMs inherit the distribution of published literature, concentrating ideation where researchers have already invested conceptual effort. This raises a core question: can AI ideation complement rather than duplicate human research directions?

AI-for-AI and Research Venues — Batch #3 wave 2 (2026-06-03)

2 notes

Can machine feedback sustain discovery at test time?

Can LLMs paired with automated evaluators discover genuinely novel solutions through iterative refinement, rather than just generating hypotheses? This matters because it tests whether autonomous research scales beyond benchmarks to real deployed innovations.

Can automated review loops handle AI-generated research at scale?

As AI agents produce papers faster than humans can evaluate them, can a closed-loop automated review system with retrieval-augmented feedback actually improve quality and catch problems traditional peer review misses?