Why do image captions create different friction than pure video data?
This explores why pairing images with text captions introduces a different kind of training problem than learning from raw video streams — and the corpus suggests the friction is about supervision and distribution, not just data volume.
This explores why pairing images with text captions introduces a different kind of training problem than learning from raw video streams. The short version the corpus offers: captions force the model to reconcile two distributions that don't naturally line up, while pure video lets the model learn from its own predictions — so the friction with captions is competition, and the friction with video is labeling.
The sharpest piece here is the finding that vision-language tension isn't inherent to mixing modalities — it comes from caption distributional shift. When you train on image-caption pairs, the caption text occupies a different statistical region than the model's general language, and a dense network with fixed capacity ends up making vision and language fight over the same parameters. The proposed fix is architectural: a Mixture-of-Experts allocates capacity per token so the two modalities coexist instead of crowding each other out Can we solve modality competition through architectural design?. So the "friction" of captions is really a capacity-allocation problem created by yoking a sparse, human-written description to a rich image.
Pure video data sidesteps that yoke entirely. Instead of needing a paired caption to say what's happening, you can mask part of the stream and have the model predict it — learning temporal, task-aware structure directly from unlabeled recordings. This explicitly trades the bottleneck of labeled video for abundant unlabeled streams Can unlabeled UI video teach models what users intend?. The friction moves: with captions you fight distribution mismatch; with raw video you fight the absence of any text signal at all, which self-supervised prediction turns into an advantage rather than a cost.
There's a deeper reason captions are lossy partners worth knowing. Text is a compressed human abstraction that strips out the physics, geometry, and causality present in the raw signal — so a caption can never carry everything the image or video frames contain, which is exactly why text-grounded models inherit predictable blind spots in physical and causal reasoning Are text-only language models fundamentally limited by abstraction?. A caption is a bottleneck by construction; video is closer to the source dynamics. Interestingly, the opposite move also works — describing an image in natural language can bridge a gap that raw embedding similarity can't, as when a system describes an unknown image and retrieves matches from a text index Can describing images in text improve zero-shot recognition?. Captions are friction during training but leverage at inference.
The through-line: captions and video aren't just different data sizes, they're different *kinds* of supervision. Captions impose a second distribution the model must align with (friction = competition, fixable by giving each modality its own capacity); video offers a continuous signal the model can supervise itself on (friction = no labels, fixable by predictive masking). Knowing which friction you're paying for tells you whether to reach for a smarter architecture or a self-supervised objective.
Sources 4 notes
Modality competition arises from caption distributional shift and rigid dense capacity allocation, not from vision and language being fundamentally incompatible. Mixture of Experts resolves the architectural bottleneck by allocating capacity per token, enabling modalities to coexist without competing.
UI-JEPA applies JEPA-style predictive masking to screen recordings, learning task-aware temporal representations that an LLM decoder can use to infer intent with minimal paired data. This trades the bottleneck of labeled video for abundant unlabeled streams.
Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.
SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.