Do critique models improve diversity during training itself?

Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.

Synthesis note · 2026-02-20 · sourced from Test Time Compute

The intuitive framing of critique models is that they help at test time: the model generates, the critic scores, we select the best. But the more important finding from AutoMathCritique is that critique integrated into the training loop improves the actor model's exploration efficiency and solution diversity during training itself.

Without critique in the loop, iterative self-training suffers from "tail narrowing" — the model converges on a narrow distribution of solutions, becoming less able to explore diverse reasoning paths. The critique model counteracts this: by providing step-level feedback on exploration, it guides the actor toward high-quality paths it wouldn't have discovered alone, maintaining distributional breadth through training.

This connects to Does policy entropy collapse limit reasoning performance in RL?: critique models are a way to maintain entropy — the exploration needed for continued improvement — without relying solely on architectural entropy management (Clip-Cov, KL-Cov). The critique is an external signal that prevents premature convergence.

The implication: critique models are training infrastructure as much as inference infrastructure. Evaluating them only on test-time accuracy misses their more fundamental role.

Inquiring lines that use this note as a source 76

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 147 in 2-hop network ·dense cluster Open in graph ↗

Do critique models improve diversity during trai… Does policy entropy collapse limit reasoning perfo… Can natural language feedback overcome numerical r… Can diversity optimization improve quality during … Can a single problem unlock reasoning through solu… Does critiquing errors teach deeper understanding …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
critique models as a mechanism against entropy collapse
Can natural language feedback overcome numerical reward plateaus? Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.
concrete evidence: Critique-GRPO shows that CoT critiques break plateaus where 8x scaling of numerical rewards fails; the NLF mechanism works precisely because critiques expand the effective exploration space that numerical rewards cannot reach
Can diversity optimization improve quality during language model training? Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?
DARLING provides the complementary mechanism: critique models maintain diversity by guiding exploration quality, while explicit semantic diversity optimization maintains diversity by directly rewarding distributional breadth — together they address the entropy collapse problem from both the feedback channel (critique) and the reward signal (diversity bonus)
Can a single problem unlock reasoning through solution critique? Does exposing models to diverse critiques of different solutions to one problem activate reasoning as effectively as training on many problems? This tests whether solution diversity matters more than problem diversity.
extends with extreme efficiency: CFT shows that diverse critiques on a *single* problem suffice for reasoning activation — the diversity-via-critique mechanism does not need a diverse problem distribution, only diverse critiques of the solution space; this is the strongest evidence for the "critique is training infrastructure" framing
Does critiquing errors teach deeper understanding than imitating correct answers? Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
extends to the training data design: training models on critiques of noisy responses produces deeper understanding than training on correct responses; the principle generalizes from "critique guides exploration" to "critique IS the training signal"

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

critique models improve exploration diversity during training not just test-time accuracy

Do critique models improve diversity during training itself?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4