Can a single problem unlock reasoning through solution critique?

Does exposing models to diverse critiques of different solutions to one problem activate reasoning as effectively as training on many problems? This tests whether solution diversity matters more than problem diversity.

Synthesis note · 2026-04-18 · sourced from Reasoning Architectures

Critique Fine-Tuning (CFT) achieves reasoning activation comparable to RLVR using only a single problem. The method collects diverse model-generated solutions to one problem, then uses a teacher LLM to generate detailed critiques of each solution. Training on these critique pairs — without any reinforcement learning — unlocks reasoning performance at a fraction of the computational cost (RLVR requires hundreds of GPU hours).

The key insight is that the diversity that matters for reasoning activation is solution diversity (many approaches to one problem) rather than problem diversity (one approach to many problems). By holding the problem constant and varying the solutions, CFT isolates the critique-and-evaluation signal as the activation mechanism.

This is the most resource-efficient confirmation yet of Do base models already contain hidden reasoning ability?. The progression of evidence is striking: RL post-training (expensive), RLVR (cheaper), 1-shot RLVR (minimal data), and now CFT (minimal data AND no RL). Each step strips away another component previously thought essential, revealing that the activation signal is remarkably simple.

CFT also extends Can a single training example unlock mathematical reasoning? in an important direction. 1-shot RLVR shows one problem suffices when RL provides the training signal. CFT shows one problem suffices when critique provides the signal instead. The common denominator is not RL, not critique, not solution diversity per se — it is exposure to the distinction between correct and incorrect reasoning applied to a specific problem.

This connects to Does RL post-training create reasoning or just deploy it? by providing yet another non-RL method that achieves similar activation. If RL, steering vectors, decoding changes, and now critique fine-tuning all unlock the same latent reasoning, the mechanism is clearly pre-training-determined and the elicitation method is incidental.

The relationship to Does critiquing errors teach deeper understanding than imitating correct answers? is direct: CFT operationalizes the principle that evaluating errors teaches more than imitating successes. But CFT goes further — it shows that evaluating errors on a single problem is sufficient, collapsing the data requirement to its theoretical minimum.

Inquiring lines that use this note as a source 13

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

critique fine-tuning on a single problem unlocks reasoning by exposing models to diverse solution critiques rather than diverse problems

Can a single problem unlock reasoning through solution critique?

Related papers in this collection 8

Search by related questions 4