Can finetuning sparse subnetworks alone match full parameter finetuning results?
This explores whether you can fine-tune just a small, selected slice of a model's weights — a sparse subnetwork — and get results as good as updating every parameter; the corpus doesn't tackle this head-on, but several notes circle the same territory from different angles.
This explores whether fine-tuning a sparse subnetwork — a deliberately chosen subset of weights — can match the results of updating all of a model's parameters. The corpus doesn't contain a clean head-to-head test of that exact claim (no lottery-ticket-style experiment), but it assembles a strong circumstantial case from three directions: that meaningful subnetworks exist, that touching only the right ones helps, and that freezing the rest often beats updating everything.
Start with whether the subnetworks are even there to find. Pruning experiments show neural networks naturally decompose tasks into isolated modular subnetworks, where ablating one piece only damages its corresponding function — and pretraining makes this modular structure more consistent and reliable Do neural networks naturally learn modular compositional structure?. Train for sparsity deliberately and the effect sharpens: sparse-weight transformers form compact circuits that ablation studies confirm are *necessary and sufficient* for the task Can sparse weight training make neural networks interpretable by design?. So the premise behind sparse fine-tuning — that a small set of weights carries the load — has direct support here.
The most pointed evidence comes from work on multi-task fine-tuning: identifying each task's core parameter region, freezing those core parameters, and only geometrically merging the non-core ones consistently outperforms standard full fine-tuning Can isolating task-specific parameters prevent multi-task fine-tuning interference?. That's the inversion worth noticing — here the win comes not from updating a sparse subnetwork but from *protecting* one while letting the rest move. The lesson is the same either way: structure matters more than parameter count, and updating everything indiscriminately causes interference that targeted approaches avoid.
The "freeze the backbone" theme recurs as a way to dodge catastrophic forgetting. SoftCoT keeps the main LLM entirely frozen and delegates new reasoning to a small auxiliary module, preserving pretrained knowledge while still adding capability Can continuous reasoning avoid forgetting in instruction-tuned models?. And DPO-trained small models match much larger ones on function calling by targeting exactly the rigid format failures full SFT fumbles Can small models match large models on function calling? — again, precision beating breadth. There's even a hint about *why* sparse updates might suffice: networks already default to sparse representations for unfamiliar inputs and dense ones only for well-learned data, so sparsity isn't a limitation imposed from outside but a structure the model grows into Is representational sparsity learned or intrinsic to neural networks?.
The honest answer: the corpus strongly suggests "yes, often" — targeted, structure-aware updates can match or beat full fine-tuning, and frozen-backbone methods avoid the forgetting that full updates risk — but it stops short of a controlled apples-to-apples benchmark on sparse-subnetwork-only fine-tuning. What it leaves you with is the more interesting reframe: the open question isn't *whether* fewer parameters can do the job, but *which* ones, and whether you fine-tune them or protect them.
Sources 6 notes
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.
Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.
SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.