Do critique models improve diversity during training itself?
Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
The intuitive framing of critique models is that they help at test time: the model generates, the critic scores, we select the best. But the more important finding from AutoMathCritique is that critique integrated into the training loop improves the actor model's exploration efficiency and solution diversity during training itself.
Without critique in the loop, iterative self-training suffers from "tail narrowing" — the model converges on a narrow distribution of solutions, becoming less able to explore diverse reasoning paths. The critique model counteracts this: by providing step-level feedback on exploration, it guides the actor toward high-quality paths it wouldn't have discovered alone, maintaining distributional breadth through training.
This connects to Does policy entropy collapse limit reasoning performance in RL?: critique models are a way to maintain entropy — the exploration needed for continued improvement — without relying solely on architectural entropy management (Clip-Cov, KL-Cov). The critique is an external signal that prevents premature convergence.
The implication: critique models are training infrastructure as much as inference infrastructure. Evaluating them only on test-time accuracy misses their more fundamental role.
Inquiring lines that use this note as a source 76
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can unified policies handle negative feedback and critique transformation simultaneously?
- Why does embedding evaluation criteria in prompts reduce creative scope?
- Can few-shot examples narrow generative diversity in creative tasks?
- How do intrinsic motivation principles explain why generating novel challenges improves learning?
- How does critique fine-tuning on one problem unlock broader reasoning?
- How do critique models prevent policy entropy collapse during reasoning training?
- Can prompting for specific creative paradigms improve ideation diversity?
- Does training on critiques of noisy responses produce deeper understanding than imitating correct ones?
- Can diverse critiques on a single problem unlock reasoning without diverse problem sets?
- Can population diversity in self-improvement prevent error avalanching failures?
- What makes external diversity more effective than sequential revision steps?
- Why does AI output show diversity without multiplying actual points of view?
- Can diverse expert demonstrations exceed the knowledge of any single expert?
- How does forced exploration through diversity rewards differ from suppression-based negative reinforcement?
- Why do research ideation systems suffer from diversity collapse despite high novelty metrics?
- How do you verify whether your context distribution satisfies covariate diversity?
- Can lower embedding dimensions alone solve the diversity problem without attention mechanisms?
- How do ensemble methods apply within a single model?
- Can diverse human creativity survive if all AI systems converge on similar outputs?
- What happens to idea diversity when AI tools draw from collective knowledge?
- What conditions make training diversity better than individual expert quality?
- How does mutual shaping through diverse training compare to population-level diversity effects?
- How does process-focused feedback compare to outcome-focused feedback in skill training?
- How do contrasting examples improve AI feedback quality over generic suggestions?
- Can co-evolved critics truly circumvent static evaluator limitations in self-improvement?
- Can structural diversity through role assignment replace emergent diversity in small models?
- Why does evaluating multiple candidates work better than judging one answer?
- How do semantic reward shaping approaches compare to full critique models?
- Why does positive reinforcement degrade diversity at higher k values?
- How does majority voting fail when reasoning samples lack genuine diversity?
- Can explicit rejection responses solve the over-specialization failure mode?
- Can negative feedback through critiques achieve the same steering flexibility as positive preferences?
- What creates the irreducible trade-off between quality and diversity in training data?
- Can debate between multiple models prevent the failures of single-model self-revision?
- How should training incorporate external critique versus encouraging self-correction?
- Does self-generated training data reduce a model's capability diversity?
- Why does critique training produce deeper understanding than imitation training?
- How do quality, diversity, and complexity create different effects on downstream model performance?
- How does diversity collapse during iterative self-improvement cycles?
- Why does external critique improve revision accuracy more than self-assessment?
- Does meta-judging improve evaluator quality better than temporal decoupling alone?
- How can AI improve the peer review bottleneck without replacing reviewers?
- Can structured decomposition fix evaluation gaps in other research tasks?
- Can runtime interventions like meta-cognitive prompting work where training interventions fail?
- How does symbolic solver feedback differ from language-based self-critique?
- Can self-training drift be prevented by applying student compatibility filtering?
- Can shifting the accuracy metric itself eliminate the need for diversity post-processing?
- How can semantic diversity optimization work if exploration and exploitation were truly opposed?
- Why does external critique improve revision while internal self-assessment fails?
- How does diversity collapse during iterative self-improvement affect solution quality?
- Why do models trained on critique fail at self-critique despite strong other-model evaluation?
- Does critique training improve exploration diversity during model training or only test time?
- How does directional diversity compare to other forms of parallel planning?
- Can LLM diversity collapse in research ideation be reversed or mitigated?
- Can explicitly optimizing for semantic diversity during RL training improve both quality and variation?
- Why does preference tuning reduce diversity in code but increase it in creative tasks?
- What happens to model grounding when preference optimization increases effective diversity?
- Why does enlarging the evaluation unit reintroduce comparability problems?
- Can external retrieval signals outperform internal self-assessment during revision?
- What makes creative writing diversity different from code diversity fundamentally?
- Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?
- Should test-time search maximize diversity of competent solutions instead of converging on one strategy?
- Does the productive difficulty band ever stabilize during training?
- What makes self-consistency a sufficient training target for the judge role?
- Why does strengthening the judge improve the actor's generation performance?
- Why do preference-tuned models produce different diversity patterns in code versus creative writing?
- Does external critique guide revision better than internal self-assessment during model training?
- Why does outcome-based RL specifically lose diversity during training?
- Does semantic diversity in output space compete with reward-component diversity?
- How much does diversity training cost in single-shot pass@1 performance?
- Which aggregation method best exploits diversity in generated solutions?
- Does verbalized sampling preserve factual accuracy and safety during diversity gains?
- Can decoding-time prompting strategies fully replace diversity-focused training methods?
- What makes multi-turn critique trajectories more effective than single-turn reasoning chains?
- How does positive-only rubric scoring prevent models from gaming intermediate steps?
- How do complexity and diversity affect model performance differently?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
critique models as a mechanism against entropy collapse
-
Can natural language feedback overcome numerical reward plateaus?
Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.
concrete evidence: Critique-GRPO shows that CoT critiques break plateaus where 8x scaling of numerical rewards fails; the NLF mechanism works precisely because critiques expand the effective exploration space that numerical rewards cannot reach
-
Can diversity optimization improve quality during language model training?
Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?
DARLING provides the complementary mechanism: critique models maintain diversity by guiding exploration quality, while explicit semantic diversity optimization maintains diversity by directly rewarding distributional breadth — together they address the entropy collapse problem from both the feedback channel (critique) and the reward signal (diversity bonus)
-
Can a single problem unlock reasoning through solution critique?
Does exposing models to diverse critiques of different solutions to one problem activate reasoning as effectively as training on many problems? This tests whether solution diversity matters more than problem diversity.
extends with extreme efficiency: CFT shows that diverse critiques on a *single* problem suffice for reasoning activation — the diversity-via-critique mechanism does not need a diverse problem distribution, only diverse critiques of the solution space; this is the strongest evidence for the "critique is training infrastructure" framing
-
Does critiquing errors teach deeper understanding than imitating correct answers?
Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
extends to the training data design: training models on critiques of noisy responses produces deeper understanding than training on correct responses; the principle generalizes from "critique guides exploration" to "critique IS the training signal"
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision
- Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
- Outcome-based Exploration for LLM Reasoning
- Vector Policy Optimization: Training for Diversity Improves Test-Time Search
- Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
- Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
- Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
- Jointly Reinforcing Diversity and Quality in Language Model Generations
Original note title
critique models improve exploration diversity during training not just test-time accuracy