Do substitute networks converge differently than complement networks?

This reads 'substitute' vs 'complement' as two ways subnetworks can relate inside a model — modules that stand in for each other (redundant/substitutable) vs. modules that have to combine to do a job (complementary/compositional) — because that's the territory the corpus actually covers; if you meant substitute-vs-complement *product* graphs in recommendation, the collection only touches that edge obliquely.

This explores whether subnetworks that can swap in for one another behave differently from subnetworks that have to work together — and the collection has surprisingly direct material on the second half of that, less on the first. The clearest finding is that neural networks naturally carve compositional tasks into isolated modular subnetworks, where ablating one only damages its own function and leaves the others intact Do neural networks naturally learn modular compositional structure?. These are *complementary* subnetworks in your sense: each does a distinct job, and the whole only works when they compose. Forcing this structure through sparse weight training makes it even cleaner — you get compact circuits where individual neurons map to single concepts and the connections between them are legible Can sparse weight training make neural networks interpretable by design?.

The interesting wrinkle is that 'complementary' composition doesn't always converge to something robust. Transformers often fake it: instead of learning subnetworks that genuinely combine, they memorize the specific computation subgraphs seen in training and stitch them by pattern-matching, which collapses the moment you ask for a novel combination Do transformers actually learn systematic compositional reasoning?. The binding problem names why — networks struggle to keep separate pieces of information distinct and then recombine them on the fly, so complementary structure that looks modular can be brittle underneath Why do neural networks fail at compositional generalization?. So 'do complement networks converge' has a real answer: they can, but whether they converge to true composition or to memorized stitching depends heavily on whether training covered the combination space Can neural networks learn compositional skills without symbolic mechanisms?.

The 'substitute' side — redundant paths that can stand in for each other — shows up most directly in work on width scaling, where a reasoning system samples many parallel latent trajectories that are interchangeable routes to the same answer rather than complementary stages Can reasoning systems scale wider instead of only deeper?. That's a different convergence regime: substitutable paths buy you robustness and coverage of the solution space without the variance blowup you'd fear, whereas complementary stages buy you the ability to do something none of them could alone. They're genuinely not the same dynamic.

Where the corpus gets thin is the literal recommendation-systems reading — substitute vs. complement *products* as graph structures. The nearest doorway is the finding that learned MLP similarities fail to match a tuned dot product in collaborative filtering, because the geometric inductive bias matters more than raw expressiveness Can MLPs learn to match dot product similarity in practice?. That hints at the deeper point relevant to your question: the *relationship* you're modeling (interchangeable vs. co-purchased, substitute vs. complement) carries an inductive bias, and networks that bake in the right geometry converge faster and cleaner than ones asked to learn it from scratch.

So the honest synthesis: the collection strongly supports that complementary (compositional) and substitutable (redundant/parallel) network structures converge under different pressures — composition is fragile and data-hungry, substitution is robust and coverage-driven — but it treats this through neural-network modularity, not through product substitute/complement graphs. If that product-graph reading was what you wanted, this is the boundary of what's here.

Sources 7 notes

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can MLPs learn to match dot product similarity in practice?

Rendle et al. show that carefully tuned dot products substantially outperform learned MLP similarities in collaborative filtering. MLPs require excessive capacity and data to match simple geometric similarity, and they cannot be efficiently retrieved at scale—proving inductive bias matters more than expressiveness.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a neural architecture researcher re-testing claims about convergence regimes in modular networks. The question remains open: do substitute networks (redundant, interchangeable paths) and complement networks (compositional, co-dependent stages) converge under fundamentally different pressure?

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat these as perishable constraints.
- Complementary (compositional) subnetworks in transformers often converge to memorized pattern-matching rather than true compositional generalization, collapsing on novel combinations (~2023–2024).
- Sparse weight training produces cleaner modular circuits with interpretable neuron-to-concept mappings, suggesting complementary structure *can* be robust under the right inductive bias (~2025–2026).
- Substitutable parallel reasoning paths (width scaling via sampling latent trajectories) provide robustness and solution-space coverage without variance blowup, a different convergence dynamic from composition (~2024–2025).
- Test-time compute scaling and recurrent depth approaches show complementary stages benefit from iterative refinement; substitutable paths benefit from coverage rather than iteration (~2025).
- Binding problem: networks struggle to segregate and recombine distinct information on the fly, underpinning brittleness in complementary structures (~2020–2023).

Anchor papers (verify; mind their dates):
- arXiv:2305.18654 (2023) – Faith and Fate: compositionality limits in transformers
- arXiv:2511.13653 (2026) – Weight-sparse transformers have interpretable circuits
- arXiv:2501.18585 (2025) – Thoughts Are All Over the Place: o1-like reasoning and depth
- arXiv:2305.18654 (2024) – Branch-Solve-Merge and multi-agent orchestration

Your task:
(1) **RE-TEST EACH CONVERGENCE CLAIM.** For memorized-stitching vs. true composition in transformers: has recent work (e.g., o1-style test-time compute, orchestration frameworks) actually dissolved this boundary by allowing complementary stages to emerge *through iterative refinement* rather than one-pass forward learning? Separately: has sparse training made substitutable vs. complementary distinction moot by forcing both into legible circuits? Cite what resolved or sharpened each constraint.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Look for papers claiming complementary stages now scale compositionally without data explosion, or showing substitutable paths can also be compositionally rich.
(3) **Propose 2 research questions** assuming the regime has shifted: (a) Do iterative refinement / test-time scaling regimes *erase* the memo-stitching failure of complementary networks? (b) Do modern sparse architectures produce networks where the substitute/complement distinction is an artifact of how you probe them, not a structural difference?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Do substitute networks converge differently than complement networks?

Sources 7 notes

Next inquiring lines