How do sparse mixture-of-experts models resolve modality capacity competition?

This explores how Mixture-of-Experts (MoE) architectures let vision and language share a model without one starving the other for capacity — and why that competition turns out to be a fixable design choice rather than a law of multimodal learning.

This reads the question as: when a single model has to handle both images and text, why do they fight over the model's capacity, and how does giving the model many specialized 'experts' resolve that fight? The corpus has a direct answer and several adjacent ideas that reframe it. The central finding is that modality competition is *architectural, not inherent* — it comes from the distributional shift between caption text and ordinary text plus the rigidity of dense models that force every token through the same fixed capacity. A sparse MoE fixes this by routing capacity per token: a visual token and a language token can take different paths through different experts, so they coexist instead of competing for the same weights Can we solve modality competition through architectural design?.

What makes this more than a one-paper claim is that the same logic — *let the network spend capacity only where it's needed* — shows up repeatedly under different names. Models naturally sparsify their activations when they hit unfamiliar or hard inputs, using sparsity as a selective filter that stabilizes performance rather than a breakdown Do language models sparsify their activations under difficult tasks?. And that sparsity isn't bolted on — networks *learn* to keep dense representations for familiar data and fall back to sparse ones for unfamiliar inputs during ordinary pretraining Is representational sparsity learned or intrinsic to neural networks?. Modality competition is arguably a special case: text is the 'familiar' distribution, image-captions are the unfamiliar one, and conditional routing is the mechanism that stops the familiar from crowding out the rest.

Sparsity also keeps paying off when you scale. At equal compute, larger sparse-attention models beat smaller dense ones on long-context work — sparsity expands the cost-performance frontier instead of trading quality for speed Does sparse attention trade off quality for speed?. That's the same bet MoE makes for modalities: spending parameters conditionally lets you afford a bigger, more capacious model within the same budget, which is exactly the resource the competing modalities were fighting over.

The lateral surprise is that routing capacity isn't the only axis, and MoE alone isn't always the ceiling. Pairing cheap O(1) lookup memory with MoE routing beats pure MoE at the same parameters and FLOPs, with a U-shaped sweet spot where balancing both mechanisms wins — biggest gains in reasoning and code, not retrieval Can lookup memory and computation work together better than either alone?. And experts don't have to be fixed at training time: models can compose task-specific expert vectors at inference by tuning only singular values Can models dynamically activate expert skills at inference time?, or even discover entirely new experts through gradient-free search in weight space Can language models discover new expertise through collaborative weight search?. So 'resolving modality competition' generalizes into a broader principle — partition capacity conditionally, and you can host competing demands without forcing them to share one rigid pool.

One caution the corpus adds: more capacity aimed at the wrong bottleneck doesn't help. For multimodal *perception*, verbose chain-of-thought and text-token reasoning actually degrade performance because the real bottleneck is visual attention allocation, not verbal capacity Does verbose chain-of-thought actually help multimodal perception tasks?. The lesson pairs neatly with the MoE finding: solving modality competition is about giving capacity the right *shape and destination*, not simply giving each modality more.

Sources 8 notes

Can we solve modality competition through architectural design?

Modality competition arises from caption distributional shift and rigid dense capacity allocation, not from vision and language being fundamentally incompatible. Mixture of Experts resolves the architectural bottleneck by allocating capacity per token, enabling modalities to coexist without competing.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Does sparse attention trade off quality for speed?

The Sparse Frontier benchmark shows that at equivalent compute cost, larger sparse-attention models outperform smaller dense models on long-context tasks. Sparsity lets you train bigger models within the same budget, making it Pareto-improving rather than a pure trade-off.

Can lookup memory and computation work together better than either alone?

Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can language models discover new expertise through collaborative weight search?

PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.

Does verbose chain-of-thought actually help multimodal perception tasks?

Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.

How do sparse mixture-of-experts models resolve modality capacity competition?

Sources 8 notes

Next inquiring lines