Can diverse critiques on a single problem unlock reasoning without diverse problem sets?

This explores whether the *variety of critiques* on one problem — seeing many right and wrong solution attempts judged — can switch on a model's latent reasoning, instead of the usual recipe of training across a large, diverse problem set.

This explores whether diversity of *critique* (many judged solutions to one problem) can substitute for diversity of *problems*, and the corpus suggests the answer is a surprisingly strong yes — the activation signal lives in the contrast between good and bad reasoning, not in the breadth of tasks. The most direct evidence is Critique Fine-Tuning, which reaches reasoning activation comparable to reinforcement-learning-from-verifiable-rewards using a single problem and teacher-written critiques of varied solutions, with no RL at all Can a single problem unlock reasoning through solution critique?. The takeaway is that exposing a model to *why this solution is right and that one is wrong* on one specific problem is a sufficient trigger — the diversity that matters is across attempts, not across questions.

That finding sits next to a striking parallel from a different method: a single training example in RLVR can lift math performance from 36% to 73.6%, and the model keeps generalizing for over a thousand steps after it has perfectly memorized that one example Can a single training example unlock mathematical reasoning?. Read together, these two papers tell the same story from opposite directions — reasoning capability is largely *latent* in the base model, and what minimal training does is unlock it rather than teach it. If the skill is already there, you don't need a thousand problems to surface it; you need the right activation signal.

Why would critiques specifically be that signal? Because critique seems to act on *diversity of exploration* rather than just accuracy. Step-level critique inside the training loop counteracts 'tail narrowing' and keeps a model's solution distribution wide across self-training iterations — a more fundamental benefit than a test-score bump, since it prevents the premature convergence that kills reasoning Do critique models improve diversity during training itself?. So critique isn't just grading; it's the thing that stops a model from collapsing onto one rigid strategy. That reframes the whole question: maybe the value of a diverse problem set was always really about forcing diverse exploration, and critique manufactures that diversity on a single problem directly.

The corpus also hints at *where* the diversity has to come from, because narrowing is the recurring enemy. Models reasoning in plain monologue get stuck in fixed strategies and fragmented attention; restructuring a single model's thinking as a dialogue between distinct internal agents recovers diversity and coherence on exactly the tasks that need multiple approaches Can dialogue format help models reason more diversely?. Similarly, spending test-time compute on diverse *abstractions* of a problem beats sampling more parallel solutions of the same shape — structured breadth beats brute depth Can abstractions guide exploration better than depth alone?. Diverse critiques are another mechanism in this same family: a way of injecting breadth into how a single problem is approached.

The thing you might not have expected to learn: this whole line of work quietly relocates the bottleneck. If one problem plus varied critiques can activate reasoning, then the failures we see in big models may not be failures of *training breadth* at all. Other notes argue reasoning collapses are really execution-bandwidth limits — models that know the algorithm but can't run it step by step in text Are reasoning model collapses really failures of reasoning? — or structural disorganization, where viable solution paths exist but get abandoned prematurely and can be rescued by simple decoding-level penalties, no retraining Why do reasoning models abandon promising solution paths?. The convergent message across these is that reasoning is mostly *present and under-activated*, and diverse critique is one of the cheapest known keys to switch it on.

Sources 7 notes

Can a single problem unlock reasoning through solution critique?

Critique Fine-Tuning achieves reasoning activation comparable to RLVR using only one problem and teacher-generated critiques of varied solutions, with no reinforcement learning. This demonstrates that exposure to correct versus incorrect reasoning on a specific problem is the sufficient activation signal.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Can dialogue format help models reason more diversely?

DialogueReason, which structures a single model's internal reasoning as dialogue between distinct agents in separate scenes, overcomes monologue reasoning's fixed-strategy and fragmented-attention weaknesses, especially on tasks requiring multiple problem-solving approaches.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability researcher re-testing whether diverse critiques on a single problem can unlock reasoning without diverse problem sets. This question remains open—treat the findings below as dated claims to be re-examined, not current truth.

What a curated library found — and when (findings span Nov 2024–Nov 2025):
• Critique Fine-Tuning on one problem reaches reasoning activation comparable to RL-from-verifiable-rewards, with no RL needed (2025-06).
• A single training example in RLVR lifts math performance from 36% to 73.6%, generalizing for 1000+ steps post-memorization (2025-05).
• Step-level critique counteracts 'tail narrowing' and keeps solution distribution wide across self-training—preventing premature convergence (2024-11).
• Dialogue-based reasoning (structured internal agents) recovers diversity and coherence on single tasks better than monologue (2025-05).
• Reasoning performance collapse is often execution-bandwidth failure, not reasoning failure; models know the algorithm but cannot step through it in text (2025-01).

Anchor papers (verify; mind their dates):
• arXiv:2506.03295 (Critique Fine-Tuning, 2025-06)
• arXiv:2504.20571 (RLVR One Example, 2025-05)
• arXiv:2411.16579 (Critique Models, 2024-11)
• arXiv:2505.07049 (DialogueReason, 2025-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Claude-4), training methods (synthetic data, preference-learning at scale), inference harnesses (multi-step caching, persistent scratchpads), or evaluation benchmarks have since RELAXED or OVERTURNED it. Separate the durable question (can critique diversity substitute for problem diversity?) from perishable limits (e.g., does it hold beyond math?). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~3 months—especially any showing critique diversity *cannot* replace problem diversity, or that execution-bandwidth is a red herring.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., does critique diversity saturate? Can a single critique suffice? Do language-model agents need *different* critique modalities (symbolic, visual, multi-agent) to keep generalizing?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can diverse critiques on a single problem unlock reasoning without diverse problem sets?

Sources 7 notes

Next inquiring lines