Can entropy regularization or critique models prevent search strategy collapse during RL training?

This explores whether two specific interventions — entropy regularization and critique/feedback models — can stop an RL-trained search agent from narrowing onto a single rigid strategy, and the corpus suggests they attack two different parts of the same problem.

This explores whether you can keep an RL-trained search agent from collapsing onto one narrow strategy by either (a) actively managing the policy's entropy or (b) feeding it richer critique signals — and the corpus has material on both, treating them as complementary rather than competing fixes. First, it's worth knowing the collapse is real and not unique to search. RL training squeezes exploration diversity in search agents through the *same* mechanism documented in reasoning: policies converge on whatever maximizes reward and abandon the rest Does reinforcement learning squeeze exploration diversity in search agents?. That convergence has a measurable signature — performance saturates as policy entropy approaches zero, following an empirical law where you can almost predict the ceiling from the entropy curve Does policy entropy collapse limit reasoning performance in RL?. And the thing being collapsed onto isn't necessarily the *best* strategy: controlled experiments show RL amplifies a single dominant format inherited from pretraining within the first epoch, with the winner often determined by model scale rather than performance Does RL training collapse format diversity in pretrained models?.

On the entropy-regularization side, the answer is a qualified yes. The named interventions — Clip-Cov, KL-Cov, and GPPO — work by managing *how* entropy is reduced during training rather than letting it crater, preserving exploratory capacity and pushing back the performance ceiling Does policy entropy collapse limit reasoning performance in RL?. But there's a tell-tale catch the corpus surfaces: even without any explicit regularizer, RL only updates 5–30% of parameters, and those sparse updates are nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?. That structural narrowing suggests entropy management is fighting a strong built-in pull toward concentration — regularization slows the collapse, it doesn't reverse the underlying tendency.

The critique-model angle is the more interesting lateral move, because it changes *what information* the policy gets rather than just how widely it samples. The core diagnosis: numerical rewards are informationally thin — they tell the model it failed but not why or how to improve. Critique-GRPO shows that models frozen on a performance plateau start producing correct solutions once given chain-of-thought critiques instead of bare scalars Can natural language feedback overcome numerical reward plateaus?. Tree-search critics do something adjacent: AlphaLLM's three critic models derive dense, process-level quality signals that rank solution *paths*, which is exactly the granularity a search agent needs to know that a strategy is dead-ending before it commits Can tree search replace human feedback in LLM training?. A leaner variant reuses cross-rollout variance simultaneously as a reward signal and a query filter, throwing out degenerate comparisons and buying 2–3× faster, more stable training Can one statistical measure serve dual purposes in RL training?.

The synthesis worth carrying away: entropy regularization and critique models prevent collapse at different layers. Entropy methods keep the policy *sampling broadly* (a width problem); critique models keep it *learning the right thing from each sample* (a signal-quality problem). The two-phase view of RL training hints at why you might want both — early training is driven by execution correctness, but the later bottleneck is strategic exploration, where planning-token entropy actually needs to *rise* Does RL training follow a predictable two-phase learning sequence?. The unexpected coda is that the cleanest fix might sit upstream of either: SFT on diverse demonstrations preserves exploration breadth that RL then erodes Does reinforcement learning squeeze exploration diversity in search agents?, implying you prevent collapse partly by what you bank *before* RL begins, not only by what you regularize during it.

Sources 8 notes

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about RL-induced strategy collapse in LLM search agents. The question remains open: can entropy regularization or critique models prevent this collapse?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026, tracking RL's effect on search diversity:
• Policy entropy crashes toward zero during RL training; Clip-Cov, KL-Cov, and GPPO preserve entropy and push back performance ceilings, but RL only updates 5–30% of parameters uniformly across seeds, suggesting structural pull toward collapse (~2025).
• RL post-training amplifies a single dominant pretraining format within the first epoch; the winner is often determined by model scale, not performance (~2025).
• Numerical rewards are informationally thin; chain-of-thought critiques and tree-search critics (ranking solution *paths*) break plateaus by supplying dense, process-level signals (~2024–2025).
• SFT on diverse demonstrations preserves exploration breadth that RL later erodes; preventing collapse may depend partly on pre-RL banking of diversity (~2024–2025).
• Two-phase RL dynamic: early training consolidates execution correctness; later bottleneck requires *rising* planning-token entropy for strategic exploration (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2505.22617 (The Entropy Mechanism, May 2025)
• arXiv:2504.07912 (Echo Chamber, Apr 2025)
• arXiv:2506.03106 (Critique-GRPO, Jun 2025)
• arXiv:2605.22817 (Vector Policy Optimization, May 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, check whether newer models (post-June 2026), training methods (e.g., mixture-of-experts RL, emergent multi-strategy routing), or richer critiques (vision, causal graphs, outcome rollouts) have relaxed or overturned the collapse tendency. Distinguish the durable question (does collapse happen without intervention?) from perishable limits (do current regularizers fully solve it?). Cite what resolved each; flag where constraints still hold.
(2) Surface the strongest work from the last ~6 months that contradicts or supersedes the library's claims — especially any showing collapse can be *beneficial* or *unavoidable by design*.
(3) Propose 2 research questions assuming the RL regime may have shifted: (a) Can mixture-weighted policy branches (rather than single collapsed mode) achieve both diversity and performance? (b) Do process-level critiques trained on *why strategies fail* prevent collapse better than reward-signal variation alone?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can entropy regularization or critique models prevent search strategy collapse during RL training?

Sources 8 notes

Next inquiring lines