How does weight sharing compound the advantages of deeper model designs?
This explores why reusing the same weights across layers pays off especially well in deep, thin networks — and the corpus only partly addresses it, with one note carrying most of the weight.
This reads the question as: when a model is built deep-and-thin rather than wide, why does sharing parameters across layers amplify the benefit rather than just save memory? The honest caveat up front: the collection has exactly one note that treats this head-on, so this is a synthesis built around it rather than across many converging sources.
The anchor is MobileLLM's finding that depth beats width below a billion parameters Does depth matter more than width for tiny language models?. The mechanism matters here: deep-and-thin architectures win because layers *compose* — each layer builds a more abstract concept on top of the one below, rather than spreading capacity sideways across a wide layer. That composition is what makes depth valuable. Weight sharing rides on top of this. If the real work of a deep model is the repeated, layered transformation of representations, then reusing the same block of weights across several of those layers lets you buy more layers of composition without paying for more parameters. Depth is the thing that helps; weight sharing is the trick that makes depth cheap. The two compound because the advantage you're multiplying (compositional depth) is precisely the one that doesn't strictly require fresh parameters at every step.
A lateral angle the corpus offers: layers aren't interchangeable, they specialize. Proxy-tuning work shows that lower layers store knowledge while upper layers handle reasoning and style — direct fine-tuning corrupts the lower-layer knowledge stores, while decoding-time tuning leaves them intact Can decoding-time tuning preserve knowledge better than weight fine-tuning?. This is a useful tension to sit with: weight sharing assumes a layer's transformation is reusable enough to apply more than once, but layer specialization says different depths do genuinely different jobs. The reconciliation is that sharing tends to be applied to *blocks* doing similar mid-network compositional work, not across the whole stack indiscriminately.
There's also a reason to expect sharing and depth to reinforce each other structurally. Work on sparse-weight training shows that forcing constraints on weights produces compact, modular, reusable circuits where neurons map to clean concepts Can sparse weight training make neural networks interpretable by design?. Sharing is a different constraint than sparsity, but it points the same direction: when you force the network to reuse machinery, you push it toward learning general, composable transformations rather than one-off layer-specific tricks — which is exactly what a deep compositional model wants. So the compounding isn't only about parameter economy; the sharing constraint may itself nudge the model toward the kind of reusable abstractions that depth is trying to exploit. Worth flagging for anyone going deeper here: the collection doesn't yet hold a paper isolating weight sharing as its own variable, so treat the link between sharing and depth as well-motivated but, in this corpus, inferred rather than directly measured.
Sources 3 notes
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.