Why do single examples trigger large reasoning improvements in models?

This explores why a single training example (or a tiny fraction of training signal) can unlock large reasoning gains — and what that says about where reasoning ability actually lives in a model.

This explores why one example can trigger outsized reasoning improvements, and the corpus points to a striking answer: the capability is already latent in the model, so training mostly *activates* it rather than *teaches* it. The clearest evidence is direct — a single example in RLVR lifts math accuracy from 36% to 73.6%, and test accuracy keeps climbing for 1,400 steps even after training accuracy maxes out at 100% Can a single training example unlock mathematical reasoning?. That post-saturation generalization is the tell: if the model were learning new skills, performance would plateau when it stops getting the training answers wrong. Instead it keeps improving, which means the example is flipping a switch on something the model already knew how to do.

A second thread explains *where* that latent capability comes from. Reasoning ability seems to be built during pretraining from broad, transferable procedural knowledge drawn from many documents — unlike factual recall, which depends on narrowly memorizing specific source documents Does procedural knowledge drive reasoning more than factual retrieval?. If the procedure is already distributed through the weights, a tiny nudge is enough to route the model into using it. That reframes the single example not as a lesson but as a key.

The most counterintuitive corner: the content of the training signal may barely matter. Models trained on deliberately corrupted or systematically irrelevant reasoning traces perform comparably to those trained on correct ones, and sometimes generalize *better* out of distribution Do reasoning traces need to be semantically correct?. Relatedly, reasoning traces behave more like stylistic mimicry than verified computation — invalid logical steps score nearly as well as valid ones Do reasoning traces show how models actually think?. So if traces are computational scaffolding rather than meaning, a single example mainly teaches the *shape* of engaging reasoning, not its substance — and shape is cheap to convey.

There's also a mechanistic view of *which part* of training carries the signal. Only about 20% of tokens are high-entropy 'forking points' where the model decides where reasoning goes next, and RLVR primarily adjusts those; training on that minority alone matches full updates Do high-entropy tokens drive reasoning model improvements?. Reasoning improvement is concentrated, not diffuse — which is exactly why a small intervention can move so much. The same concentration logic appears at decoding time: just penalizing premature thought-switching improves accuracy with no fine-tuning at all, because viable solutions are being abandoned rather than never found Do reasoning models switch between ideas too frequently?, Why do reasoning models abandon promising solution paths?.

The quiet caution worth knowing: 'activation' is not the same as deepening reasoning. Fine-tuning can raise benchmark scores while cutting the causal link between reasoning steps and answers — Information Gain drops ~39% as models shift to post-hoc rationalization Does supervised fine-tuning improve reasoning or just answers?, Does fine-tuning disconnect reasoning steps from final answers?. And because models fit instance-level patterns rather than general algorithms, a single example helps most when test problems resemble it; failures track unfamiliarity, not difficulty Do language models fail at reasoning due to complexity or novelty?. So the surprising lesson is that 'one example unlocks reasoning' and 'the model is mostly performing reasoning it already had' are the same finding seen from two angles.

Sources 10 notes

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Why do single examples trigger large reasoning improvements in models?

Sources 10 notes

Next inquiring lines