SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Do critique models improve diversity during training itself?

Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.

Synthesis note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The intuitive framing of critique models is that they help at test time: the model generates, the critic scores, we select the best. But the more important finding from AutoMathCritique is that critique integrated into the training loop improves the actor model's exploration efficiency and solution diversity during training itself.

Without critique in the loop, iterative self-training suffers from "tail narrowing" — the model converges on a narrow distribution of solutions, becoming less able to explore diverse reasoning paths. The critique model counteracts this: by providing step-level feedback on exploration, it guides the actor toward high-quality paths it wouldn't have discovered alone, maintaining distributional breadth through training.

This connects to Does policy entropy collapse limit reasoning performance in RL?: critique models are a way to maintain entropy — the exploration needed for continued improvement — without relying solely on architectural entropy management (Clip-Cov, KL-Cov). The critique is an external signal that prevents premature convergence.

The implication: critique models are training infrastructure as much as inference infrastructure. Evaluating them only on test-time accuracy misses their more fundamental role.

Inquiring lines that use this note as a source 76

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 147 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

critique models improve exploration diversity during training not just test-time accuracy