Can LLMs replace search engines during agent training?
Explores whether LLMs possess sufficient internal knowledge to simulate search engines for RL training, potentially eliminating expensive API costs while maintaining training signal quality.
Two papers converge on the same principle from different angles: LLMs possess enough internal world knowledge to serve as their own search engines during RL training, eliminating the prohibitive API costs of real search engine interaction.
ZeroSearch addresses this architecturally. Lightweight SFT transforms a small LLM (3B-14B) into a retrieval module that generates both relevant and noisy documents in response to a query. The key advantage over real search: controllable document quality. By adjusting prompts, the simulator generates either helpful or misleading documents, enabling a curriculum rollout strategy that progressively degrades quality during training. The policy model first learns basic formats, then adapts to increasingly challenging retrieval scenarios.
The result is striking: a 7B retrieval module achieves comparable performance to a real search engine. A 14B module surpasses it. The LLM-simulated environment provides more stable and controllable training than noisy real-world search.
SSRL (Self-Search RL) approaches the same principle from the inference side. LLMs auto-regressively generate search queries, then generate relevant information to address them — the entire reasoning trajectory in a single forward pass. The internal knowledge scales with inference budget: pass@k performance improves substantially with sampling, achieving high accuracy on BrowseComp. RL further enhances this Self-Search capability through format-based and rule-based rewards.
The tension with Why do search agents beat memorized retrieval on hard questions? is real but conditional. Real-world search outperforms simulated search on tasks requiring temporal currency or rare knowledge. But for the majority of training iterations where the goal is learning search behavior (when to search, how to formulate queries, how to evaluate results), simulated search provides adequate signal at dramatically lower cost.
SSRL adds a surprising finding: thinking tokens are inefficient for search tasks. Long CoT does not improve Self-Search performance — contradicting the pattern seen in math reasoning. Search primarily requires knowledge retrieval, not extended deliberation. Short-CoT should be preferred to maximize token efficiency.
Inquiring lines that use this note as a source 19
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can agent-based simulators replace real-user A/B testing for studying recommendation system harms?
- Can parallel agents or complementary mechanisms replace single-human interrogation of LLMs?
- How does content-only knowledge in LLMs enable pretraining popularity to leak through?
- What deployment feedback loops amplify LLM pretraining popularity in live systems?
- How do controllable simulators compare to population-level agent simulation approaches?
- What safety protections work when simulators have access to real APIs?
- How does real tool integration change what agents learn compared to simulated tools?
- Can knowledge graphs generate scalable training data for deep search agents?
- Can the serving loop itself become the primary training data source?
- How do search API lookups enable LLM recommenders over proprietary or dynamic corpora?
- When does simulated search outperform real search for agent training?
- What happens when you train user simulators instead of task agents?
- What makes software engineering environments better suited for RL than other interactive domains?
- How do recommender metrics drive LLM query refinement in closed-loop training?
- How does LLM simulation of APIs avoid instability without sacrificing training signal?
- What makes natural-language APIs particularly suited to LLM-based simulation?
- How do knowledge graphs scale as training data for open-ended search tasks?
- How much does external API latency dominate total agent execution cost?
- Can high benchmark scores mislead deployment decisions for search agents?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do search agents beat memorized retrieval on hard questions?
Deep research agents trained on live web search outperform models fine-tuned on static knowledge. Does real-world RL's advantage come from smarter reasoning, or from bypassing the limitations of memorized facts?
the tension: real search for deployment, simulated search for training
-
Can prompt optimization teach models knowledge they lack?
Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.
Self-Search is the extreme version: the model activates its own knowledge as search results
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
search/knowledge-retrieval is another task type where extended reasoning is inefficient
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- ZeroSearch: Incentivize the Search Capability of LLMs without Searching
- Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
- Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
- Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
- DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
- SSRL: Self-Search Reinforcement Learning
- RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
- R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Original note title
llms can simulate search engines via internal knowledge eliminating api costs for rl training of search agents