What task structures benefit most from geometric parameter merging?

This explores when averaging or geometrically combining model weights from separately-trained tasks actually helps — and what shape a task has to have for that merge not to blur things together.

This explores when merging model weights from separately fine-tuned tasks pays off, rather than just averaging everything and hoping. The clearest answer in the corpus comes from work on parameter isolation: the merge works best when you *don't* merge the parts that matter most. Research on multi-task interference shows that each task has a small set of 'core' parameter regions doing its essential work, and the winning recipe is to freeze those core regions per task while geometrically merging only the non-core parameters — the shared, less task-defining weights Can isolating task-specific parameters prevent multi-task fine-tuning interference?. So the task structures that benefit are ones where you can cleanly separate a task-specific core from a mergeable common substrate, and where overlapping tasks can be clustered together. When tasks genuinely share structure, the geometry of their non-core weights lines up and averaging them helps; when they don't, merging the cores destroys both.

That reframes the real question: it's less 'which tasks merge well' and more 'which tasks decompose cleanly into separable pieces.' The corpus has a recurring theme that modular, decomposable tasks are the friendly ones. Separating a 'decomposer' from a 'solver' in multi-step reasoning improves accuracy precisely because planning and execution stop interfering with each other — and notably, the decomposition skill transfers across domains while solving does not Does separating planning from execution improve reasoning accuracy?. Function calling tells the same story from the other side: it breaks into seven distinct sub-skills (nested calls, chaining, parallel functions, parameter detection, and so on), and training them as explicit separate tasks generalizes better than one umbrella dataset Can breaking function calling into subtasks improve model generalization?. Tasks with this kind of clean internal seam are exactly the ones where you'd expect parameter-space combination to behave, because the boundaries between sub-skills are real rather than tangled.

The flip side is what predicts merge *failure*, and here the corpus is sharp: tasks with opposing internal dynamics fight each other. Structured domains (math, code) push output entropy down, while creative open-ended domains push it up — and naively combining them lets entropy collapse from the structured side damage the open-ended capability Does training order reshape how models handle different task types?. When the very statistics two tasks want to move in opposite directions, blending their weights blends a contradiction. This is the lateral lesson: geometric merging assumes the tasks' weight changes are roughly compatible vectors, and tasks that demand opposite behaviors violate that assumption.

There's also a quieter warning worth knowing. Two models can post identical accuracy yet have completely different internal organization — one cleanly structured, one 'fractured' despite being linearly decodable Can models be smart without organized internal structure?. For merging, that matters because geometry-in-weight-space only works if the geometry is actually meaningful; a task that scores well but has disorganized internal representations is a poor merge candidate even though its metrics look fine. Standard evaluation won't tell you which case you're in.

So the short version a curious reader can carry away: geometric parameter merging rewards tasks that split into a stable task-specific core plus a shared mergeable remainder, that have genuine modular seams (decompose/solve, sub-skills of a tool-use pipeline), and that don't pull the model's internal statistics in opposing directions. The interesting twist is that the best results come from merging *less* — isolating and protecting the parameters that define each task, and only combining the common ground.

Sources 5 notes

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

What task structures benefit most from geometric parameter merging?

Sources 5 notes

Next inquiring lines