What makes some training data teach brittle answers versus robust reasoning?

This explores why some training data produces models that memorize correct-looking answers while other data builds reasoning that holds up on new problems — and what distinguishes the two.

This explores why some training data produces models that memorize correct-looking answers while other data builds reasoning that holds up on new problems. The corpus converges on a surprising answer: brittleness comes less from *wrong* data than from *too-clean* data. When you train on polished shortcut solutions — the final answer, the confident trace, the verified path — the model learns to reproduce the surface of reasoning without the substance. Training on messier material, including failure and recovery, tends to teach the more durable thing.

The sharpest evidence is the gap between what benchmarks measure and what models actually learn. Supervised fine-tuning can raise final-answer accuracy while *degrading* the quality of the reasoning steps that produce it — one study measured a 38.9% drop in 'information gain,' meaning the model increasingly arrives at right answers through post-hoc rationalization rather than genuine inference Does supervised fine-tuning improve reasoning or just answers?. The metric improves; the reasoning rots. A related failure: training on labeled examples of 'good arguments' teaches models surface patterns, not the underlying criteria — only explicit theoretical frameworks transfer to argument types the model hasn't seen Can models learn argument quality from labeled examples alone?. In both cases the data taught the answer's shape, not its logic.

What builds robustness instead? Engagement with failure. Training on complete exploration paths — including dead ends, backtracking, and self-correction — internalizes search rather than memorizing solutions, and produces deeper reasoning than shortcut traces Can models learn better by training on messy exploration paths?. Training models to *critique* noisy responses beats training them to imitate correct ones, because critique forces engagement with how things go wrong Does critiquing errors teach deeper understanding than imitating correct answers?. The recurring theme: data that exposes the model to error structure generalizes; data that hides it doesn't.

There's a second, subtler axis — what the data does to a model's *uncertainty*. Richer teacher context (conditioning on the correct answer and verifier output) produces confident, concise student traces that ace in-domain tests but collapse out-of-distribution, because the confident style suppresses the epistemic caution that hard new problems require Does richer teacher context hurt student generalization?. Post-training objectives reliably push toward correctness while silently degrading 'unmeasured' behaviors like expressing doubt — single-objective optimization leaves the stylistic features critical to generalization unprotected Can post-training objectives preserve reasoning style alongside correctness?. You can even read brittleness off the confidence curve: models that commit early and rationalize show measurably flawed reasoning, and rewarding *gradual* confidence growth improves accuracy dramatically without any process labels Can confidence trajectories reveal when reasoning goes wrong?. Confident-but-brittle and uncertain-but-robust turn out to be trainable opposites — and confidence itself predicts whether a model survives prompt rephrasing Does model confidence predict robustness to prompt changes?.

The most disorienting finding complicates the whole picture: models trained on *deliberately corrupted* reasoning traces perform comparably to those trained on correct ones, and sometimes generalize better — suggesting traces partly function as computational scaffolding, not meaningful logic Do reasoning traces need to be semantically correct?. Read alongside the finding that base models already contain latent reasoning that minimal training merely *elicits* rather than creates Do base models already contain hidden reasoning ability?, a reframing emerges: maybe robust-vs-brittle isn't about teaching reasoning at all, but about whether your data *selects for* capability already present versus *overwrites* it with a confident, shortcut-shaped veneer. The brittle answer isn't the model failing to learn — it's the model learning the wrong thing too well.

Sources 10 notes

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Can models learn better by training on messy exploration paths?

Research shows that training on messy trajectories—failed attempts, self-correction, and backtracking—teaches more robust reasoning than training only on shortcut solutions. This approach models o1-style deep reasoning as search internalization rather than solution memorization.

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Can post-training objectives preserve reasoning style alongside correctness?

Research shows that post-training objectives faithfully guide models toward correct answers yet simultaneously suppress unmeasured behaviors like epistemic verbalization. Single-objective optimization creates blind spots where stylistic features critical to generalization are unprotected.

Can confidence trajectories reveal when reasoning goes wrong?

Models that commit to answers early then rationalize show measurable flawed reasoning. Rewarding gradual confidence growth via RL improves accuracy significantly—on Countdown by 42 percentage points—without needing process labels or external reward models.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

What makes some training data teach brittle answers versus robust reasoning?

Sources 10 notes

Next inquiring lines