Does reinforcement learning squeeze exploration diversity in search agents?
Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.
The "RL Squeezes, SFT Expands" paper studies search agents trained with RL versus SFT and finds the same pattern that the reasoning literature documented: RL training compresses the diversity of behaviors the agent explores (squeezes), while SFT on diverse demonstrations expands it. Since Does policy entropy collapse limit reasoning performance in RL?, and since this paper shows the same dynamic in search RL, entropy collapse is not a quirk of reasoning training — it is a property of RL training at large.
The mechanism is the same in both domains: RL rewards the policy for high-reward outputs and penalizes low-reward ones. Over training, the policy concentrates probability mass on the reward-maximizing region of its action space. In reasoning, this means converging on a narrow set of reasoning patterns. In search, it means converging on a narrow set of query strategies. Both reduce the agent's ability to explore novel approaches to hard problems.
SFT has the opposite effect because it trains on human demonstrations or diverse synthetic completions — the diversity of the training set is preserved in the policy. The tradeoff is that SFT cannot generalize beyond its demonstrations in the same way RL can.
This finding has practical implications for DR agent design: RL-trained search agents need explicit diversity mechanisms (entropy regularization, diverse reward models, periodic SFT refreshes) or they will converge on query templates that work well on average but fail on distribution shift. The same Do critique models improve diversity during training itself? remedy applies — external critique prevents the RL agent from collapsing to a narrow search strategy.
Inquiring lines that use this note as a source 120
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Which AI interaction patterns preserve learning while which ones degrade skill formation?
- Can few-shot examples narrow generative diversity in creative tasks?
- Do dynamic environments enable different kinds of agent-environment coevolution?
- Do disorder-specific RL policies outperform single policies across anxiety, depression, and schizophrenia?
- How does entropy collapse in reinforcement learning differ from entropy maintenance in graph reasoning?
- Does optimizing directly for semantic diversity improve both reasoning quality and exploration?
- What role does environment diversity play in preventing agents from overfitting to curator imagination?
- Why does RLVR increase token entropy while decreasing answer diversity?
- What role does exploration-exploitation balance play in abstraction formation?
- When does natural context diversity reduce the need for explicit exploration?
- How do evolutionary archives enable diverse exploration in self-improving systems?
- Can population diversity in self-improvement prevent error avalanching failures?
- Why do evolutionary algorithms collapse to single solutions under selection pressure?
- How does latent space diffusion enable evolutionary search in high dimensions?
- Can accelerated sampling techniques from image generation speed up evolutionary search?
- Do task-specific heuristics improve gradually or appear suddenly at scale?
- Why does island model genetic evolution maintain diversity better than single populations?
- Does policy entropy collapse represent the main bottleneck in reasoning-focused RL scaling?
- How does forced exploration through diversity rewards differ from suppression-based negative reinforcement?
- Why do research ideation systems suffer from diversity collapse despite high novelty metrics?
- How does covariate diversity compare to the exploration assumptions of LinUCB?
- Can diverse human creativity survive if all AI systems converge on similar outputs?
- Why does policy entropy collapse limit reasoning and dialogue RL scaling?
- What training difficulty and curriculum settings prevent instability in empathetic agent RL?
- How does co-player diversity force agents to develop general adaptation?
- How does mutual shaping through diverse training compare to population-level diversity effects?
- Can meta-reinforcement learning explain why this bias pattern emerges rationally?
- How does reinforcement learning compare to differentiable joint training for RAG?
- Does social scaffolding outperform purely intrinsic motivation for agent exploration?
- Can combinational creativity alone drive open-ended learning in agents?
- Do task-specific heuristics emerge because they compress well enough?
- Can structural diversity through role assignment replace emergent diversity in small models?
- When does simulated search outperform real search for agent training?
- How does explicit exploratory prompting compare to fine-tuned reinforcement learning for in-context adaptation?
- Does policy entropy collapse limit how many iterations of reasoning training work?
- How does role specialization preserve reasoning diversity in multi-agent teams?
- Can cognitive diversity overcome expertise gaps in agent teams?
- Can cognitive diversity compensate for lack of expertise in agent teams?
- How does entropy collapse affect creative capability in multi-task settings?
- Why does positive reinforcement degrade diversity at higher k values?
- Can suppressing incorrect behavior alone solve the diversity bottleneck in reasoning RL?
- Why does exploration quality matter more than learner network depth?
- How does majority voting fail when reasoning samples lack genuine diversity?
- What distinguishes training-time entropy collapse from test-time variance inflation?
- How does RLHF-induced mode collapse limit diversity in LLM-generated personas?
- Can evolutionary search solve persona diversity better than prompt engineering?
- Can diversity-aware RL objectives prevent format convergence?
- How does diversity loss in synthetic data mirror tail distribution disappearance?
- How does policy entropy during training affect search discipline during inference?
- How does the pretrained prior set a capability ceiling for reward model exploration?
- Does RL refine existing knowledge or discover entirely new capabilities?
- How does RL compress reasoning path diversity during training?
- Why does policy entropy collapse predict sigmoid saturation points?
- Which recipe choices determine the asymptotic ceiling in RL training?
- What happens to model reasoning when policy entropy collapses during RL?
- How do RL subnetworks identified from different random seeds compare?
- Why do high entropy tokens carry most of the learning signal in RL?
- Does sparsity in RL arise from training on policy-distribution data?
- Is distribution selection during RL the same compression mechanism as entropy collapse?
- How does next-turn reward optimization contribute to agent passivity?
- How does behavior cloning reduce complexity before RL training in rerankers?
- Does context diversity ever make active exploration unnecessary in bandits?
- What makes behavioral cloning produce more persuadable but less aligned agents?
- How does diversity collapse during iterative self-improvement cycles?
- How does representational convergence differ from policy entropy collapse in iterative training?
- Can curriculum approaches teach agents when to stop exploring?
- Why does policy entropy collapse primarily at token level rather than hidden states?
- How can semantic diversity optimization work if exploration and exploitation were truly opposed?
- How does diversity collapse during iterative self-improvement affect solution quality?
- What distinguishes intrinsic search from extrinsic search method approaches?
- Does critique training improve exploration diversity during model training or only test time?
- Can historical and batch exploration be implemented with the same algorithmic mechanism?
- How does trajectory burstiness compare to other structural properties that shape emergent capabilities?
- Can small numbers of curated demonstrations produce emergent agentic behavior?
- Why does prolonged RL discover strategies absent from any base model sample?
- What training objectives could reduce completion bias in autonomous agents?
- Why does RL behavior differ between standard reasoning tasks and complex planning domains?
- How does directional diversity compare to other forms of parallel planning?
- How do human-agent systems incorporate diverse feedback into model behavior?
- How do high-entropy tokens concentrate reinforcement learning's effect?
- What training method supports dynamic tool discovery in long-horizon agents?
- How does memory folding enable agents to reconsider strategies mid-task?
- Can explicitly optimizing for semantic diversity during RL training improve both quality and variation?
- Why does preference tuning reduce diversity in code but increase it in creative tasks?
- What happens to model grounding when preference optimization increases effective diversity?
- Can training on diverse related tasks be more efficient than task-specific training?
- Does the pretrained prior actually constrain what internalized search can discover?
- Why do current metacognitive training loops fail when agents encounter new domains?
- Can pretrained priors set exploration ceilings for empathetic capability development?
- How does curriculum learning prevent instability in social-emotional RL training?
- How does entropy loss enable exploration beyond a single training example?
- How does on-policy entropy recognition differ from training-time entropy collapse?
- Why do single-turn RL methods fail to generalize to multi-turn tasks?
- How should multi-objective post-training balance competing behavioral goals?
- How should we evaluate diversity differently across programming and creative tasks?
- Can the exploration ceiling be raised beyond what pretraining established?
- Why does policy entropy collapse when scaling RL for reasoning?
- What makes supervised fine-tuning worsen RL exploration later?
- Can entropy regularization or critique models prevent search strategy collapse during RL training?
- Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?
- Does the pretrained model prior limit RL search capability more than the optimization algorithm itself?
- Should test-time search maximize diversity of competent solutions instead of converging on one strategy?
- How does probability mass concentration affect sampling diversity across model scales?
- Why does diversity collapse occur in multi-agent research ideation despite high novelty?
- Can backward planning reduce search difficulty when multiple goal state paths exist?
- Why does the pretrained prior determine the exploration ceiling?
- What causes policy entropy collapse in reasoning-focused reinforcement learning?
- Why does outcome-based RL specifically lose diversity during training?
- Does semantic diversity in output space compete with reward-component diversity?
- Does policy entropy collapse prevent inference-time search from finding solutions?
- How much does diversity training cost in single-shot pass@1 performance?
- Can evolutionary search unlock problems that best-of-n selection cannot solve?
- Can the same problem be solved by multiple evolutionary search strategies?
- Does verbalized sampling preserve factual accuracy and safety during diversity gains?
- Can decoding-time prompting strategies fully replace diversity-focused training methods?
- Why does strategy diversity within reasoning chains improve model generalization?
- How does active selection of training content differ from random reinforcement sampling?
- Do gains from harness-based agents transfer across different search benchmarks?
- Can agents escape weak belief tracking and conservative action selection traps?
- What makes exploration a verifiable and measurable training objective?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
extends: entropy collapse is confirmed in the search domain; the bottleneck is architectural, not reasoning-specific
-
Do critique models improve diversity during training itself?
Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
applies: the diversity-preservation remedy generalizes to search RL; critique models prevent search strategy collapse
-
Can simple rewards alone teach complex domain reasoning?
Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
parallel RL emergence pattern: domain reasoning capabilities (AlphaMed) and search capabilities both emerge from RL reward signals; entropy collapse constrains scaling in both
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
algorithm-invariance evidence in reasoning and entropy collapse in search are the same mechanism from different angles: both show RL is bounded by the pretrained prior, not by optimizer choice
-
Does RL training collapse format diversity in pretrained models?
Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
the format-level selection mechanism: RL entropy collapse in search narrows strategy diversity within one distribution, while the echo chamber effect selects which pretraining distribution survives — format selection precedes and compounds within-format diversity loss
-
Should training maximize diversity when models feed into search?
If a model runs inside a test-time search loop that samples many rollouts and picks the best, does training for entropy and diversity unlock better solutions than training for a single sharp answer?
extends: this note prescribes the diversity-as-objective training fix for the entropy-collapse-in-search failure that note documents
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Vector Policy Optimization: Training for Diversity Improves Test-Time Search
- Jointly Reinforcing Diversity and Quality in Language Model Generations
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- Outcome-based Exploration for LLM Reasoning
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
- The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
- Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
- DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Original note title
rl training for search agents squeezes exploration diversity while sft expands it — the same entropy collapse dynamic operates in search as in reasoning