Does reinforcement learning squeeze exploration diversity in search agents?

Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.

Synthesis note · 2026-02-21 · sourced from Deep Research

The "RL Squeezes, SFT Expands" paper studies search agents trained with RL versus SFT and finds the same pattern that the reasoning literature documented: RL training compresses the diversity of behaviors the agent explores (squeezes), while SFT on diverse demonstrations expands it. Since Does policy entropy collapse limit reasoning performance in RL?, and since this paper shows the same dynamic in search RL, entropy collapse is not a quirk of reasoning training — it is a property of RL training at large.

The mechanism is the same in both domains: RL rewards the policy for high-reward outputs and penalizes low-reward ones. Over training, the policy concentrates probability mass on the reward-maximizing region of its action space. In reasoning, this means converging on a narrow set of reasoning patterns. In search, it means converging on a narrow set of query strategies. Both reduce the agent's ability to explore novel approaches to hard problems.

SFT has the opposite effect because it trains on human demonstrations or diverse synthetic completions — the diversity of the training set is preserved in the policy. The tradeoff is that SFT cannot generalize beyond its demonstrations in the same way RL can.

This finding has practical implications for DR agent design: RL-trained search agents need explicit diversity mechanisms (entropy regularization, diverse reward models, periodic SFT refreshes) or they will converge on query templates that work well on average but fail on distribution shift. The same Do critique models improve diversity during training itself? remedy applies — external critique prevents the RL agent from collapsing to a narrow search strategy.

Inquiring lines that use this note as a source 120

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 109 in 2-hop network ·medium cluster Open in graph ↗

Does reinforcement learning squeeze exploration … Does policy entropy collapse limit reasoning perfo… Do critique models improve diversity during traini… Can simple rewards alone teach complex domain reas… Does the choice of RL algorithm actually matter fo… Does RL training collapse format diversity in pret… Should training maximize diversity when models fee…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
extends: entropy collapse is confirmed in the search domain; the bottleneck is architectural, not reasoning-specific
Do critique models improve diversity during training itself? Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
applies: the diversity-preservation remedy generalizes to search RL; critique models prevent search strategy collapse
Can simple rewards alone teach complex domain reasoning? Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
parallel RL emergence pattern: domain reasoning capabilities (AlphaMed) and search capabilities both emerge from RL reward signals; entropy collapse constrains scaling in both
Does the choice of RL algorithm actually matter for reasoning? Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
algorithm-invariance evidence in reasoning and entropy collapse in search are the same mechanism from different angles: both show RL is bounded by the pretrained prior, not by optimizer choice
Does RL training collapse format diversity in pretrained models? Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
the format-level selection mechanism: RL entropy collapse in search narrows strategy diversity within one distribution, while the echo chamber effect selects which pretraining distribution survives — format selection precedes and compounds within-format diversity loss
Should training maximize diversity when models feed into search? If a model runs inside a test-time search loop that samples many rollouts and picks the best, does training for entropy and diversity unlock better solutions than training for a single sharp answer?
extends: this note prescribes the diversity-as-objective training fix for the entropy-collapse-in-search failure that note documents

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl training for search agents squeezes exploration diversity while sft expands it — the same entropy collapse dynamic operates in search as in reasoning

Does reinforcement learning squeeze exploration diversity in search agents?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4