Do substitute networks converge differently than complement networks?
This reads 'substitute' vs 'complement' as two ways subnetworks can relate inside a model — modules that stand in for each other (redundant/substitutable) vs. modules that have to combine to do a job (complementary/compositional) — because that's the territory the corpus actually covers; if you meant substitute-vs-complement *product* graphs in recommendation, the collection only touches that edge obliquely.
This explores whether subnetworks that can swap in for one another behave differently from subnetworks that have to work together — and the collection has surprisingly direct material on the second half of that, less on the first. The clearest finding is that neural networks naturally carve compositional tasks into isolated modular subnetworks, where ablating one only damages its own function and leaves the others intact Do neural networks naturally learn modular compositional structure?. These are *complementary* subnetworks in your sense: each does a distinct job, and the whole only works when they compose. Forcing this structure through sparse weight training makes it even cleaner — you get compact circuits where individual neurons map to single concepts and the connections between them are legible Can sparse weight training make neural networks interpretable by design?.
The interesting wrinkle is that 'complementary' composition doesn't always converge to something robust. Transformers often fake it: instead of learning subnetworks that genuinely combine, they memorize the specific computation subgraphs seen in training and stitch them by pattern-matching, which collapses the moment you ask for a novel combination Do transformers actually learn systematic compositional reasoning?. The binding problem names why — networks struggle to keep separate pieces of information distinct and then recombine them on the fly, so complementary structure that looks modular can be brittle underneath Why do neural networks fail at compositional generalization?. So 'do complement networks converge' has a real answer: they can, but whether they converge to true composition or to memorized stitching depends heavily on whether training covered the combination space Can neural networks learn compositional skills without symbolic mechanisms?.
The 'substitute' side — redundant paths that can stand in for each other — shows up most directly in work on width scaling, where a reasoning system samples many parallel latent trajectories that are interchangeable routes to the same answer rather than complementary stages Can reasoning systems scale wider instead of only deeper?. That's a different convergence regime: substitutable paths buy you robustness and coverage of the solution space without the variance blowup you'd fear, whereas complementary stages buy you the ability to do something none of them could alone. They're genuinely not the same dynamic.
Where the corpus gets thin is the literal recommendation-systems reading — substitute vs. complement *products* as graph structures. The nearest doorway is the finding that learned MLP similarities fail to match a tuned dot product in collaborative filtering, because the geometric inductive bias matters more than raw expressiveness Can MLPs learn to match dot product similarity in practice?. That hints at the deeper point relevant to your question: the *relationship* you're modeling (interchangeable vs. co-purchased, substitute vs. complement) carries an inductive bias, and networks that bake in the right geometry converge faster and cleaner than ones asked to learn it from scratch.
So the honest synthesis: the collection strongly supports that complementary (compositional) and substitutable (redundant/parallel) network structures converge under different pressures — composition is fragile and data-hungry, substitution is robust and coverage-driven — but it treats this through neural-network modularity, not through product substitute/complement graphs. If that product-graph reading was what you wanted, this is the boundary of what's here.
Sources 7 notes
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
Rendle et al. show that carefully tuned dot products substantially outperform learned MLP similarities in collaborative filtering. MLPs require excessive capacity and data to match simple geometric similarity, and they cannot be efficiently retrieved at scale—proving inductive bias matters more than expressiveness.