Why does full multi-task fine-tuning perform worse than sequential training?

This explores why cramming all tasks into one joint fine-tuning run tends to underperform training tasks one-after-another — and what the corpus says the real culprit is (task interference over shared parameters), rather than ordering being magic on its own.

This explores why full multi-task fine-tuning — updating one model on every task at once — often loses to sequential training, where tasks arrive in order. The corpus points to a single underlying cause: tasks fight over the same parameters, and joint training forces a compromise that serves none of them well. When you isolate what each task actually needs, the gap mostly disappears.

The sharpest evidence is that ordering alone isn't the fix — structure is. One line of work shows that identifying the 'core' parameter regions each task depends on, freezing those, and geometrically merging the rest beats standard multi-task fine-tuning; crucially, it finds that temporal scheduling by itself is insufficient without explicit structural parameter isolation Can isolating task-specific parameters prevent multi-task fine-tuning interference?. In other words, sequential training often wins not because order is inherently better, but because doing tasks one at a time accidentally reduces the head-on collision that joint training creates.

But order does carry real mechanical weight, and the reason is surprising: entropy. Training structured tasks (math, code) drives a model's output entropy down, while open-ended creative tasks push it up — so the *sequence* determines whether one task's entropy collapse damages another's capabilities. Training structured tasks first, guided by backward-transfer measurements, yielded a 6.2% gain over joint training precisely by preventing that collapse from spilling into open-ended skills Does training order reshape how models handle different task types?. Joint training blends these opposing dynamics into a single averaged update, which is exactly why it underperforms. A related finding shows the same direction-dependence in preference tuning: the same procedure reduces diversity in code but increases it in creative writing, because the domains reward opposite things Does preference tuning always reduce diversity the same way? — so a one-size-fits-all joint objective is pulling in contradictory directions.

The more provocative takeaway, though, is that the whole framing may be a false choice. Several notes reframe forgetting and interference as a *misallocation* problem rather than an unavoidable cost. Splitting adaptation into slow weight updates and fast textual context reaches equivalent performance with far less catastrophic forgetting — evidence that the interference was never inherent, just badly routed Can splitting adaptation into two channels reduce forgetting?. Architectures that compose task-specific expert vectors at inference time mix skills *without* interference at all Can models dynamically activate expert skills at inference time?, and freezing the backbone while delegating new work to a small auxiliary module preserves prior capability outright Can continuous reasoning avoid forgetting in instruction-tuned models?.

Worth knowing: the multi-task-loses story isn't universal. When a single capability is decomposed into genuinely *complementary* subtasks — the seven facets of function calling, say — explicit multi-task training generalizes better than lumpy umbrella datasets Can breaking function calling into subtasks improve model generalization?. The deciding factor is whether your tasks reinforce or contend. Multi-task fine-tuning loses when tasks compete for the same parameters and pull entropy in opposite directions; it wins when they're facets of one skill. Sequential training is just the cheapest way to dodge contention — not the only one.

Sources 7 notes

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating why full multi-task fine-tuning underperforms sequential training in LLMs. The question remains open; treat the findings below as dated claims (2023–2026) to be re-tested against current capability and method.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A curated library identified:
- Task interference stems from parameter competition and opposing entropy dynamics (structured tasks collapse entropy; creative tasks expand it), not inherently from joint training (2025–2026).
- Sequential training wins largely by accident—isolating core task-dependent parameter regions and freezing them outperforms standard multi-task fine-tuning; ordering alone is insufficient without structural isolation (2025).
- Entropy-aware scheduling yields 6.2% gain over joint training by preventing structured-task entropy collapse from degrading open-ended skills (2025).
- Alternative routing—slow weight adaptation + fast textual context, or task-specific expert composition at inference—avoids catastrophic forgetting entirely, suggesting interference is a routing problem, not inevitable (2025–2026).
- Multi-task *wins* when tasks are genuinely complementary facets (e.g., seven function-calling subtasks), not when they compete (2024).

Anchor papers (verify; mind their dates):
- arXiv:2508.21741 (2025-08): Smart parameter isolation for fine-tuning.
- arXiv:2501.06252 (2025-01): Self-adaptive LLMs with expert composition.
- arXiv:2605.12484 (2026-05): Continual adaptation (fast and slow).
- arXiv:2407.00121 (2024-06): Multi-task function-calling decomposition.

Your task:
(1) RE-TEST EACH CONSTRAINT. For entropy-driven interference: have newer optimizers, scheduler designs, or loss landscapes (e.g., SAM, DPO variants, mixture-of-experts at scale) since relaxed the opposing-entropy problem? For parameter isolation: do recent pruning methods or LoRA variants now auto-detect task-critical regions cheaply? Judge whether the core question—task contention is *solvable* by routing, not inevitable—still holds or has been overturned.
(2) Surface the strongest work from the last ~4 months that *contradicts* the "multi-task is routing" frame (e.g., evidence that joint training has intrinsic sample-efficiency or generalization benefits that isolation sacrifices).
(3) Propose 2 research questions that assume the regime may have moved: (a) Can a single model trained on mixed-entropy tasks via online entropy-balancing (not scheduling) match sequential performance? (b) Does auxiliary-head composition scale to >100 tasks, or does expert interference re-emerge at scale?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does full multi-task fine-tuning perform worse than sequential training?

Sources 7 notes

Next inquiring lines