Can diverse critiques on a single problem unlock reasoning without diverse problem sets?
This explores whether the *variety of critiques* on one problem — seeing many right and wrong solution attempts judged — can switch on a model's latent reasoning, instead of the usual recipe of training across a large, diverse problem set.
This explores whether diversity of *critique* (many judged solutions to one problem) can substitute for diversity of *problems*, and the corpus suggests the answer is a surprisingly strong yes — the activation signal lives in the contrast between good and bad reasoning, not in the breadth of tasks. The most direct evidence is Critique Fine-Tuning, which reaches reasoning activation comparable to reinforcement-learning-from-verifiable-rewards using a single problem and teacher-written critiques of varied solutions, with no RL at all Can a single problem unlock reasoning through solution critique?. The takeaway is that exposing a model to *why this solution is right and that one is wrong* on one specific problem is a sufficient trigger — the diversity that matters is across attempts, not across questions.
That finding sits next to a striking parallel from a different method: a single training example in RLVR can lift math performance from 36% to 73.6%, and the model keeps generalizing for over a thousand steps after it has perfectly memorized that one example Can a single training example unlock mathematical reasoning?. Read together, these two papers tell the same story from opposite directions — reasoning capability is largely *latent* in the base model, and what minimal training does is unlock it rather than teach it. If the skill is already there, you don't need a thousand problems to surface it; you need the right activation signal.
Why would critiques specifically be that signal? Because critique seems to act on *diversity of exploration* rather than just accuracy. Step-level critique inside the training loop counteracts 'tail narrowing' and keeps a model's solution distribution wide across self-training iterations — a more fundamental benefit than a test-score bump, since it prevents the premature convergence that kills reasoning Do critique models improve diversity during training itself?. So critique isn't just grading; it's the thing that stops a model from collapsing onto one rigid strategy. That reframes the whole question: maybe the value of a diverse problem set was always really about forcing diverse exploration, and critique manufactures that diversity on a single problem directly.
The corpus also hints at *where* the diversity has to come from, because narrowing is the recurring enemy. Models reasoning in plain monologue get stuck in fixed strategies and fragmented attention; restructuring a single model's thinking as a dialogue between distinct internal agents recovers diversity and coherence on exactly the tasks that need multiple approaches Can dialogue format help models reason more diversely?. Similarly, spending test-time compute on diverse *abstractions* of a problem beats sampling more parallel solutions of the same shape — structured breadth beats brute depth Can abstractions guide exploration better than depth alone?. Diverse critiques are another mechanism in this same family: a way of injecting breadth into how a single problem is approached.
The thing you might not have expected to learn: this whole line of work quietly relocates the bottleneck. If one problem plus varied critiques can activate reasoning, then the failures we see in big models may not be failures of *training breadth* at all. Other notes argue reasoning collapses are really execution-bandwidth limits — models that know the algorithm but can't run it step by step in text Are reasoning model collapses really failures of reasoning? — or structural disorganization, where viable solution paths exist but get abandoned prematurely and can be rescued by simple decoding-level penalties, no retraining Why do reasoning models abandon promising solution paths?. The convergent message across these is that reasoning is mostly *present and under-activated*, and diverse critique is one of the cheapest known keys to switch it on.
Sources 7 notes
Critique Fine-Tuning achieves reasoning activation comparable to RLVR using only one problem and teacher-generated critiques of varied solutions, with no reinforcement learning. This demonstrates that exposure to correct versus incorrect reasoning on a specific problem is the sufficient activation signal.
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
DialogueReason, which structures a single model's internal reasoning as dialogue between distinct agents in separate scenes, overcomes monologue reasoning's fixed-strategy and fragmented-attention weaknesses, especially on tasks requiring multiple problem-solving approaches.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.