Why did prior multi-token prediction methods fail during fine-tuning?

This explores why earlier attempts to make models predict several tokens at once broke down specifically in the fine-tuning (post-training) stage, and what changed to make it work.

This explores why multi-token prediction — having a model commit to several future tokens at once rather than one at a time — kept failing when applied during fine-tuning, not pretraining. The corpus points to one paper that tackles this head-on: CAFT Can models learn multi-token concepts during fine-tuning?. Its framing of the failure is "next-token fragmentation." A coherent idea like a protein motif or a multi-word concept gets shredded into per-token pieces during ordinary fine-tuning, and earlier multi-token schemes couldn't reassemble those pieces into a stable target the model could actually learn toward. CAFT's fix is to grow auxiliary prediction heads by self-distillation, so the multi-token objective is bootstrapped from the model's own knowledge rather than imposed cold — and notably, even a lightweight LoRA version beats full standard fine-tuning, suggesting the multi-token setting is where the signal actually lives.

Why would fine-tuning be the fragile moment specifically? A neighboring result gives a clue: direct weight fine-tuning corrupts knowledge stored in a model's lower layers, which is why decoding-time proxy-tuning preserves more of what the base model knew Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Multi-token objectives ask more of those same fragile layers at once, so a method that's merely workable in pretraining can tip into destructive during post-training. CAFT's self-distillation route sidesteps this by leaning on knowledge the model already holds instead of overwriting it.

There's a deeper reason the naive version was doomed, visible in work on how reasoning tokens are weighted. Not all tokens carry equal learning signal — only about 20% of tokens are high-entropy "forking points" that actually steer the model, and training on just those matches full updates Do high-entropy tokens drive reasoning model improvements?. Relatedly, models internally rank tokens by function, preferentially preserving symbolic-computation tokens while pruning grammar and filler Which tokens in reasoning chains actually matter most?. A multi-token method that treats every position as equally important is fighting the model's own internal economy — it spends its budget predicting low-stakes tokens, which is exactly the kind of fragmentation CAFT names.

The doorway worth walking through here: the lesson isn't "multi-token prediction is hard" but "fine-tuning is a structurally riskier place to change a model than it looks." The same lower-layer fragility that breaks multi-token objectives also shows up as RL collapsing format diversity onto a single pretrained pattern Does RL training collapse format diversity in pretrained models?. Prior multi-token methods failed during fine-tuning for the same family of reasons many post-training interventions misfire: they overwrite what the model already encodes instead of building on it.

Sources 5 notes

Can models learn multi-token concepts during fine-tuning?

CAFT successfully brings multi-token prediction to post-training via self-distilled auxiliary heads, outperforming next-token fine-tuning on tasks like protein design. CAFT LoRA even outperforms full next-token fine-tuning, suggesting models learn more effectively in multi-token settings.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why did prior multi-token prediction methods fail during fine-tuning?

Sources 5 notes

Next inquiring lines