Why do vision and language scale so differently?
IsoFLOP analysis reveals vision and language follow distinct scaling curves—vision demands far more training data than language at equivalent compute budgets. Understanding this asymmetry matters for designing multimodal architectures that serve both modalities well.
The Chinchilla scaling laws characterize text pretraining: for a given compute budget, there is an optimal trade-off between model size and training tokens, and language models near this balance get the most out of their compute. The IsoFLOP analysis in Beyond Language Modeling shows that vision does not obey the same trade-off curve. Vision is significantly more data-hungry than language at the same compute scale — the optimal allocation pushes harder toward more tokens and proportionally smaller models.
This is a fundamental scaling asymmetry. In dense multimodal models, you have to pick a scaling regime: optimize for language (Chinchilla-balanced) and underfit vision, or optimize for vision (data-hungry) and waste capacity on language. There is no single dense allocation that serves both modalities well.
Sparse MoE resolves the asymmetry by shifting how each modality's tokens are processed. In the sparse regime, the analysis shows that language scaling itself shifts toward a more data-hungry pattern — aligning with vision's optimum. The mechanism is that sparsity provides the structural flexibility for modalities with fundamentally different scaling behaviors to coexist. Each modality's tokens can route to experts trained on that modality's optimal data ratio.
The finding has implications beyond multimodal architectures. It suggests that scaling laws are not properties of "training" in general but of the data distribution being trained on. Different data has different scaling exponents. The Chinchilla balance is the language exponent; the vision exponent is different; the audio exponent is presumably different again. Future multimodal architectures will need to accommodate this heterogeneity rather than assume a single scaling regime applies.
It also reframes what MoE is for. The standard story is computational efficiency — sparse activation reduces inference cost. The scaling-asymmetry story adds a complementary justification: MoE is the architecture that enables modalities with fundamentally different data appetites to coexist in one model. This is a deeper role than efficiency, and it suggests sparsity should be a default rather than an optimization in multimodal pretraining.
Inquiring lines that use this note as a source 3
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we solve modality competition through architectural design?
Does modality competition in multimodal models stem from fundamental training conflicts, or from specific architectural choices? Understanding the root cause could reveal whether the trade-off is solvable.
same paper, the architectural mechanism this scaling argument depends on
-
Are text-only language models fundamentally limited by abstraction?
Explores whether text's compression of physics, geometry, and causality into symbols creates an irreducible ceiling for language-only AI, and whether multimodal approaches can overcome this structural constraint.
same paper, the motivation for going beyond language-only scaling
-
Can deep learning theory unify around training dynamics?
Is learning mechanics—focused on average-case predictions and training dynamics rather than worst-case bounds—the emerging framework that finally unifies fragmented deep learning theory?
adjacent: scaling laws as one of the five strands of learning mechanics; modality-specific scaling exponents are a refinement
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Beyond Language Modeling: An Exploration of Multimodal Pretraining
- No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
- Pixels, Patterns, but No Poetry: To See The World like Humans
- 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
- Continual Instruction Tuning for Large Multimodal Models
- VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
- Scaling Laws for Neural Language Models
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Original note title
vision and language have fundamentally different scaling exponents — sparsity bridges them by enabling data-hungry vision alongside Chinchilla-balanced language