What prevents representation collapse in latent-prediction world models like JEPA?
This explores why latent-prediction world models like JEPA — which learn by predicting their own internal embeddings rather than raw pixels or tokens — don't simply cheat by collapsing every input to a single constant, and what design choices actually hold that collapse off.
This explores the central failure mode of any model that predicts its own latents: if the encoder is free to map everything to the same point, the prediction loss goes to zero and the model has learned nothing. The corpus suggests the answer is less about clever architecture than about one well-chosen pressure that forces the representation to stay spread out. The clearest demonstration is a JEPA trained end-to-end from raw pixels using nothing but next-embedding prediction plus a single Gaussian-latent regularizer — a constraint that pushes the latent distribution to stay broad rather than degenerate — which collapses six tunable knobs down to one while still planning 48× faster than heavier foundation-model world models Can a single regularizer prevent JEPA representation collapse?. The lesson is that collapse isn't prevented by stopping the model from cheating directly; it's prevented by making the cheap, trivial solution statistically expensive.
There's a deeper reason this kind of self-prediction is worth saving from collapse in the first place. A formal sample-complexity argument shows that predicting latents recovers compositional, hierarchical structure exponentially faster than predicting raw tokens — because embeddings at the same level of abstraction are far more correlated with each other than raw inputs are, so the model needs only a constant number of samples per layer of hierarchy instead of an exponential blowup Why is predicting latents more sample-efficient than tokens?. That correlation is exactly the property a collapse would destroy: the regularizer's job is to keep the latent space rich enough that this sample-efficiency advantage survives.
The subtler danger the corpus surfaces is that collapse has quiet cousins that no loss curve will flag. A representation can pass every linear-probe and accuracy test while being internally fractured — all the decodable features present, but organized so badly that the model shatters under perturbation or distribution shift Can models be smart without organized internal structure?. This reframes the JEPA collapse problem: a regularizer that prevents full collapse doesn't guarantee a well-structured latent, and standard metrics won't tell you the difference. Relatedly, hidden states can shift their geometry adaptively — language models sparsify their activations under out-of-distribution stress as a stabilizing filter rather than a breakdown Do language models sparsify their activations under difficult tasks? — a reminder that not every change in representation density is degeneration; some of it is the network protecting itself.
What would a healthy, non-collapsed latent space look like? The corpus offers a target: networks that decompose tasks into modular subnetworks, where ablating one piece cleanly removes one function, and pretraining sharpens that modularity Do neural networks naturally learn modular compositional structure?. And latent representations can become a genuine scaling axis in their own right — latent-thought models add capacity by growing the latent rather than the parameter count, coupling fast local learning of the latent with slow global learning of the decoder Can latent thought vectors scale language models beyond parameters?. Both are the opposite of collapse: structure that's organized, separable, and expandable.
The thing you didn't know you wanted to know: the hard problem in latent world models was never "how do we predict embeddings" — it was "how do we stop the model from making its own target trivial." The encouraging finding is that a single distributional constraint can do most of that work. The cautionary finding is that surviving collapse and being well-structured are two different bars, and the gap between them is invisible to the metrics most people watch.
Sources 6 notes
LeWorldModel trains a JEPA end-to-end using only next-embedding prediction and a Gaussian-latent regularizer, reducing tunable hyperparameters from six to one. The model achieves competitive control performance and 48× faster planning than foundation-model world models on a single GPU.
A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.