Why do vision and language have different optimal scaling curves?

This explores why image models and text models follow different rules for how much data and compute they each need to keep improving — and what the corpus says about reconciling them in one model.

This explores why vision and language don't improve at the same rate when you scale them up, and what that mismatch reveals about each modality. The cleanest answer in the corpus is that they sit in different regimes: language scales close to the Chinchilla balance (roughly proportional growth of data and parameters), while vision is far more data-hungry — it keeps wanting more images relative to its size Why do vision and language scale so differently?. The practical fix that work proposes is sparse mixture-of-experts: routing tokens to modality-specific experts effectively shifts language toward vision's data-hungry regime, letting both coexist optimally inside a single model instead of forcing one compromise curve on both.

Why would the appetites differ in the first place? A useful adjacent framing is that text is a lossy compression of reality — it strips out the physics, geometry, and causal structure that images still carry Are text-only language models fundamentally limited by abstraction?. Language arrives pre-abstracted by humans, so a model extracts a lot per token; vision carries raw, redundant, high-dimensional signal, so it needs far more examples to distill comparable structure. The scaling exponent gap isn't an accident of architecture — it tracks how much each medium has already been pre-digested before the model ever sees it.

The corpus also pushes back on the idea that there's one universal scaling curve at all. Below a billion parameters, depth beats width for language — composing concepts through more layers outperforms spreading parameters sideways, directly contradicting the smooth Kaplan-style law Does depth matter more than width for tiny language models?. And once you fold architectural variables (hidden size, MLP-to-attention ratio, GQA) into the scaling law itself, the "optimal" point moves — there's no single curve, there's a curve conditional on the shape you chose Can architecture choices improve inference efficiency without sacrificing accuracy?. So "different optimal curves" is partly a story about the modality and partly about the fact that scaling laws are local, not universal.

There's a sharper twist when the two modalities are forced together. Verbose chain-of-thought — which reliably helps language reasoning — actively degrades multimodal perception, because the real bottleneck there is visual attention allocation, not more text tokens Does verbose chain-of-thought actually help multimodal perception tasks?. That's the scaling-curve divergence showing up at the optimization level: pouring more of language's favored resource into a vision task optimizes the wrong target. The lesson running through all of these is that "scale" isn't one knob — each modality has its own bottleneck, and the curve you should follow depends on which bottleneck you're actually fighting.

Sources 5 notes

Why do vision and language scale so differently?

IsoFLOP analysis shows language scales near Chinchilla balance while vision is significantly more data-hungry. Sparse MoE shifts language toward the data-hungry regime, enabling both modalities to coexist optimally in one model by routing tokens to modality-specific experts.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Does verbose chain-of-thought actually help multimodal perception tasks?

Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.

Why do vision and language have different optimal scaling curves?

Sources 5 notes

Next inquiring lines