Can finetuning sparse subnetworks alone match full parameter finetuning results?

This explores whether you can fine-tune just a small, selected slice of a model's weights — a sparse subnetwork — and get results as good as updating every parameter; the corpus doesn't tackle this head-on, but several notes circle the same territory from different angles.

This explores whether fine-tuning a sparse subnetwork — a deliberately chosen subset of weights — can match the results of updating all of a model's parameters. The corpus doesn't contain a clean head-to-head test of that exact claim (no lottery-ticket-style experiment), but it assembles a strong circumstantial case from three directions: that meaningful subnetworks exist, that touching only the right ones helps, and that freezing the rest often beats updating everything.

Start with whether the subnetworks are even there to find. Pruning experiments show neural networks naturally decompose tasks into isolated modular subnetworks, where ablating one piece only damages its corresponding function — and pretraining makes this modular structure more consistent and reliable Do neural networks naturally learn modular compositional structure?. Train for sparsity deliberately and the effect sharpens: sparse-weight transformers form compact circuits that ablation studies confirm are *necessary and sufficient* for the task Can sparse weight training make neural networks interpretable by design?. So the premise behind sparse fine-tuning — that a small set of weights carries the load — has direct support here.

The most pointed evidence comes from work on multi-task fine-tuning: identifying each task's core parameter region, freezing those core parameters, and only geometrically merging the non-core ones consistently outperforms standard full fine-tuning Can isolating task-specific parameters prevent multi-task fine-tuning interference?. That's the inversion worth noticing — here the win comes not from updating a sparse subnetwork but from *protecting* one while letting the rest move. The lesson is the same either way: structure matters more than parameter count, and updating everything indiscriminately causes interference that targeted approaches avoid.

The "freeze the backbone" theme recurs as a way to dodge catastrophic forgetting. SoftCoT keeps the main LLM entirely frozen and delegates new reasoning to a small auxiliary module, preserving pretrained knowledge while still adding capability Can continuous reasoning avoid forgetting in instruction-tuned models?. And DPO-trained small models match much larger ones on function calling by targeting exactly the rigid format failures full SFT fumbles Can small models match large models on function calling? — again, precision beating breadth. There's even a hint about *why* sparse updates might suffice: networks already default to sparse representations for unfamiliar inputs and dense ones only for well-learned data, so sparsity isn't a limitation imposed from outside but a structure the model grows into Is representational sparsity learned or intrinsic to neural networks?.

The honest answer: the corpus strongly suggests "yes, often" — targeted, structure-aware updates can match or beat full fine-tuning, and frozen-backbone methods avoid the forgetting that full updates risk — but it stops short of a controlled apples-to-apples benchmark on sparse-subnetwork-only fine-tuning. What it leaves you with is the more interesting reframe: the open question isn't *whether* fewer parameters can do the job, but *which* ones, and whether you fine-tune them or protect them.

Sources 6 notes

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher tracking whether sparse-subnetwork fine-tuning can match full-parameter updates. The question remains open despite strong circumstantial evidence.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as time-stamped, not current ground truth:
• Pruning and sparse-weight training reveal modular, task-isolated subnetworks that are necessary and sufficient for their functions (2023–2024).
• Multi-task fine-tuning: protecting core task-specific parameters while merging non-core ones outperforms standard full fine-tuning (~2025).
• Freezing the backbone (e.g., SoftCoT delegating reasoning to auxiliary modules) preserves pretrained knowledge while avoiding catastrophic forgetting (~2025).
• DPO-trained small models match large ones on function calling by targeting rigid format failures, suggesting precision beats parameter breadth (~2024).
• Networks learn sparse representations for unfamiliar inputs and dense ones only for well-learned data, implying sparsity is intrinsic, not externally imposed (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2301.10884 (2023) – Break It Down: Structural Compositionality
• arXiv:2508.21741 (2025) – Not All Parameters Are Created Equal: Smart Isolation
• arXiv:2511.13653 (2025) – Weight-sparse transformers have interpretable circuits
• arXiv:2512.12134 (2025) – SoftCoT: Soft Chain-of-Thought

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, ask: have newer model scales, pruning methods (magnitude, lottery-ticket variants), training harnesses (LoRA+sparsity masks, MoE fine-tuning), or evals (zero-shot generalization, cross-domain transfer) since relaxed or overturned it? Separate the durable question—*which* parameters to update and *how to identify* them—from perishable limitations like "small models can't do it" or "sparsity always costs accuracy". Where does each constraint still hold?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any paper shown full fine-tuning systematically beats sparse approaches, or shown sparse fine-tuning fails on large-scale tasks?
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., "Can learned sparsity masks from one task transfer to new tasks without recomputation?" or "Does layer-wise sparse fine-tuning outperform weight-level sparsity?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can finetuning sparse subnetworks alone match full parameter finetuning results?

Sources 6 notes

Next inquiring lines