Can a single problem unlock reasoning through solution critique?
Does exposing models to diverse critiques of different solutions to one problem activate reasoning as effectively as training on many problems? This tests whether solution diversity matters more than problem diversity.
Critique Fine-Tuning (CFT) achieves reasoning activation comparable to RLVR using only a single problem. The method collects diverse model-generated solutions to one problem, then uses a teacher LLM to generate detailed critiques of each solution. Training on these critique pairs — without any reinforcement learning — unlocks reasoning performance at a fraction of the computational cost (RLVR requires hundreds of GPU hours).
The key insight is that the diversity that matters for reasoning activation is solution diversity (many approaches to one problem) rather than problem diversity (one approach to many problems). By holding the problem constant and varying the solutions, CFT isolates the critique-and-evaluation signal as the activation mechanism.
This is the most resource-efficient confirmation yet of Do base models already contain hidden reasoning ability?. The progression of evidence is striking: RL post-training (expensive), RLVR (cheaper), 1-shot RLVR (minimal data), and now CFT (minimal data AND no RL). Each step strips away another component previously thought essential, revealing that the activation signal is remarkably simple.
CFT also extends Can a single training example unlock mathematical reasoning? in an important direction. 1-shot RLVR shows one problem suffices when RL provides the training signal. CFT shows one problem suffices when critique provides the signal instead. The common denominator is not RL, not critique, not solution diversity per se — it is exposure to the distinction between correct and incorrect reasoning applied to a specific problem.
This connects to Does RL post-training create reasoning or just deploy it? by providing yet another non-RL method that achieves similar activation. If RL, steering vectors, decoding changes, and now critique fine-tuning all unlock the same latent reasoning, the mechanism is clearly pre-training-determined and the elicitation method is incidental.
The relationship to Does critiquing errors teach deeper understanding than imitating correct answers? is direct: CFT operationalizes the principle that evaluating errors teaches more than imitating successes. But CFT goes further — it shows that evaluating errors on a single problem is sufficient, collapsing the data requirement to its theoretical minimum.
Inquiring lines that use this note as a source 13
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does execution-guided critique differ from abstract action evaluation?
- How does critique fine-tuning on one problem unlock broader reasoning?
- Can diverse critiques on a single problem unlock reasoning without diverse problem sets?
- Can diverse expert demonstrations exceed the knowledge of any single expert?
- Why does evaluating multiple candidates work better than judging one answer?
- Why does critique training produce deeper understanding than imitation training?
- Can a single correct example seed exponential improvement in mathematical reasoning?
- How can one training example improve reasoning across thousands of unseen problems?
- Does critique training improve exploration diversity during model training or only test time?
- What is the distinction between teaching reasoning how versus when to activate?
- Why does decomposition ability transfer across domains but solving ability does not?
- Why do students learn better from explanations than from solving problems from scratch?
- How does the Learning Law explain why all examples should contribute equally?
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem
- Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
- Rethinking Thinking Tokens: LLMs as Improvement Operators
- Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
- Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
- Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
- Eliciting Reasoning in Language Models with Cognitive Tools
Original note title
critique fine-tuning on a single problem unlocks reasoning by exposing models to diverse solution critiques rather than diverse problems