How does preference learning differ from supervised finetuning for reasoning?
This explores why training a model to imitate correct answers (supervised finetuning) and training it against ranked or rewarded alternatives (preference learning) produce different reasoning behavior — not just different scores.
This explores why supervised finetuning (SFT) and preference learning diverge specifically on *reasoning* quality, rather than on whether the final answer is right. The short version the corpus keeps circling: SFT teaches a model what answer to produce, while preference learning teaches it which way of getting there is better — and those turn out to be very different lessons.
The sharpest evidence is what SFT quietly breaks. One study finds that supervised finetuning raises benchmark accuracy while *cutting* the quality of the reasoning steps by nearly 39% — the model learns to produce correct-looking answers through post-hoc rationalization rather than genuine inference, and standard metrics never catch it because they only check the final token Does supervised fine-tuning improve reasoning or just answers?. A related finding shows fine-tuning on labeled examples teaches surface patterns rather than principled criteria: models fed labeled 'good arguments' learn what good arguments look like, not what makes them good, and fail to generalize to new types Can models learn argument quality from labeled examples alone?. Imitation copies the form of reasoning without the function.
Preference and reward-based learning attack this from the other side: instead of one gold trace to imitate, they compare traces against each other and reward the better one. RLAG rewards both answer accuracy *and* explanation rationality, internalizing coherent knowledge structures in a way that beats SFT precisely because it prioritizes reasoning quality over token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. You don't even need humans to do the ranking — model confidence in its own answer span can rank reasoning traces into synthetic preferences that strengthen step-by-step reasoning Can model confidence work as a reward signal for reasoning?, and a model can be aligned to written principles by maximizing the mutual information between principle and response, no preference labels at all Can models learn behavioral principles without preference labels?.
But the cleaner framing may be that this isn't really an either/or. The strongest open-reasoning result came from preference *trees* — a data structure that holds diverse solution chains, critique-and-revision trajectories, and pairwise comparisons all at once, feeding both SFT and preference learning from the same source What alignment data structure best trains reasoning generalists?. SFT gives the model a competent starting distribution; preference learning then sculpts *which* of its reasoning modes to favor. That maps onto a deeper claim: base models already contain latent reasoning ability, and post-training mostly *selects* rather than *creates* it Do base models already contain hidden reasoning ability?. If reasoning is being elicited rather than installed, then preference learning's comparative signal is just a more precise selection tool than imitation.
The cautionary note for both: neither method reliably installs a *procedure*. RL-tuned models — including GRPO — still drop sharply on out-of-distribution variants, suggesting they sharpen template-matching and memorization rather than genuine problem-solving Do fine-tuned language models actually learn optimization procedures?. And preference learning is only as good as its rankings: annotation data secretly mixes genuine preferences, non-attitudes, and constructed-on-the-spot judgments, and treating them as one signal contaminates the reward model Do all annotation responses measure the same underlying thing?. So the real difference isn't 'preference learning reasons and SFT memorizes' — it's that preference learning gives you a knob for *what to prefer*, and the quality of your reasoning is now bottlenecked on whether you actually know what good reasoning looks like.
Sources 9 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
SAMI finetunes language models to increase mutual information between constitutions and responses without preference labels or demonstrations. A mistral-7b trained this way outperformed base and instruction-tuned baselines, and surprisingly, a weaker model could write principles to align a stronger one.
Eurus achieved state-of-the-art open-model reasoning by training on ULTRAINTERACT, an alignment dataset structured as preference trees per instruction. The tree format unified diverse planning strategies, interaction-and-critique trajectories, and pairwise data for both SFT and preference learning.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.