SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Model Architecture and Internals

Can models learn multi-token concepts during fine-tuning?

Does training models to predict multiple tokens at once, rather than one token sequentially, help them form coherent semantic units? This matters because current next-token prediction fragments concepts like "ribonucleic acid" into arbitrary subword pieces.

Synthesis note · 2026-02-22 · sourced from Training Fine Tuning
How do you build domain expertise into general AI models? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Next-token prediction fragments multi-token concepts into arbitrary subword units. "Ribonucleic acid" becomes "rib" → "on" → "ucle" → "ic" → "acid" — five separate prediction targets with no unified semantic representation. Concept-Aware Fine-Tuning (CAFT) introduces multi-token prediction into post-training, enabling models to learn sequences that span multiple tokens as coherent concepts.

Prior multi-token prediction methods worked only during pretraining — prohibitively expensive and dominated by general language modeling rather than domain-specific concept formation. Attempts to apply multi-token prediction to fine-tuning previously failed because multi-token prediction represents a dramatic distribution shift that short post-training phases cannot absorb. CAFT solves this through self-distilled auxiliary heads: first train auxiliary heads (predicting positions beyond the next token) using an instruction-tuning mixture with self-distilled ground truth, then fine-tune with multi-token loss on top of standard LoRA or full fine-tuning.

The results: CAFT consistently outperforms next-token fine-tuning across text summarization and de novo protein design. CAFT LoRA often outperforms next-token full fine-tuning — suggesting models learn more effectively in a multi-token setting even with fewer trainable parameters. In settings where multi-token prediction is highly advantageous (protein design, where amino acid sequences have multi-residue semantic units), multi-fold performance increases are observed.

This connects to the format-shapes-reasoning finding: since Does training data format shape reasoning strategy more than domain?, the prediction unit (single token vs. multi-token) is a format variable that shapes what the model learns. Multi-token prediction is a higher-level format that encourages conceptual chunking rather than token-by-token prediction.

The democratization aspect matters: pretraining-phase MTP was restricted to well-resourced labs. CAFT brings this to fine-tuning, where any practitioner can apply it. Trained task-agnostic auxiliary heads are provided for popular open-source models.

Inquiring lines that use this note as a source 4

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 119 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

multi-token concept-aware fine-tuning overcomes next-token fragmentation to form coherent semantic entities during post-training