How much does pretraining quality affect the modularity of fine-tuned models?
This reads 'modularity' as the degree to which a model keeps its skills as separable, composable parts after fine-tuning — and asks whether a stronger pretraining base is what makes those parts hold together.
This reads 'modularity' as how cleanly a model's skills stay separable and composable after fine-tuning — and whether the pretrained base is what holds that structure together. The corpus points to a fairly direct answer: pretraining quality is largely what makes modularity possible in the first place, and fine-tuning is mostly editing the seams rather than building the parts. The cleanest evidence is from pruning studies showing that networks naturally implement compositional subroutines in isolated subnetworks — and crucially, that pretraining substantially increases how consistent and reliable that modular decomposition is across architectures and domains Do neural networks naturally learn modular compositional structure?. Modularity isn't installed by the fine-tuning objective; it's inherited.
That inheritance has a layered architecture. One study decouples the two phases and finds pretraining scale builds factual knowledge in the lower layers while fine-tuning scale adjusts behavioral helpfulness in the upper layers Do pretraining and fine-tuning scale independently in language models?. So the 'modules' — the stored knowledge and latent capabilities — live in territory pretraining owns, and fine-tuning operates a layer up. This is why a strong base tolerates astonishingly light fine-tuning: LIMA shows 1000 curated examples on a strong pretrained model match models trained on orders of magnitude more, because post-training activates capabilities that already exist rather than building them Can careful curation replace massive alignment datasets?. The same theme runs through reasoning — RL post-training teaches a model *when* to deploy reasoning, not *how*, because the strategies pre-exist in the base as latent activation patterns Does RL post-training create reasoning or just deploy it?.
The sharper, less obvious lesson is what happens when fine-tuning reaches *down* into the pretrained layers — that's where modularity gets damaged. Direct weight fine-tuning corrupts knowledge storage in the lower layers, while decoding-time proxy-tuning preserves pretrained knowledge far better precisely because it leaves base weights untouched and only shifts reasoning and style Can decoding-time tuning preserve knowledge better than weight fine-tuning?. RL training can also collapse the format diversity a model inherited, converging on a single dominant pretraining distribution and suppressing the alternatives — a literal reduction in the base's compositional repertoire Does RL training collapse format diversity in pretrained models?. And fine-tuning can hollow out the *connection* between modules: after fine-tuning, reasoning chains less reliably influence final answers, becoming performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?.
The most modular-friendly methods all work by *not* overwriting the base. Transformer² tunes only the singular values of weight matrices to produce composable expert vectors that mix at inference without interfering with each other Can models dynamically activate expert skills at inference time?. The implication for your question: pretraining quality sets the ceiling on modularity, and fine-tuning's job is to preserve and route those modules, not rebuild them. The fine-tuning approaches that fail at modularity are the ones that try to teach genuinely new procedures by force — and they tend to just sharpen memorization instead, collapsing on out-of-distribution variants because no real modular procedure was installed Do fine-tuned language models actually learn optimization procedures?.
Sources 9 notes
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.