How do latents at the same hierarchy level become more correlated than tokens?
This explores why representations learned at a given level of abstraction (latents) share more statistical structure with each other than the raw tokens they were built from — and what that buys a model that learns to predict its own latents instead of the next token.
This explores why latents at the same hierarchy level end up more correlated than tokens, and why that correlation is the whole point. The cleanest answer comes from a sample-complexity argument: when data is generated by a compositional hierarchy, the tokens at the bottom are the noisy, high-variance leaves — many different surface tokens can express the same underlying chunk, so any two tokens share relatively little mutual information. The latents one level up are the shared causes behind those tokens. Because they sit closer to the generative structure, sibling latents at the same level co-vary tightly. Predicting your own latents (the data2vec/JEPA family) therefore recovers the hierarchy with a number of samples that stays constant in depth, while token-level next-word prediction needs exponentially more — precisely because same-level latents are far more correlated than raw tokens Why is predicting latents more sample-efficient than tokens?.
Where does that correlation structure come from in the first place? You don't need a hierarchy-building mechanism — it falls out of corpus statistics. Spectral analysis of word co-occurrence matrices reproduces the same nested geometry found in trained embeddings, meaning the hierarchy is latent in the data and the model merely reads it off Where does hierarchical structure in language models come from?. The leading eigenvectors of that structure split taxonomy coarse-to-fine — broad branches first, finer sub-branches after — mirroring the WordNet tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. Each eigenvector level is a coherent band of correlated directions; that banding is exactly the same-level correlation the sample-efficiency result exploits.
The practical payoff shows up when models are built to reason in latent space rather than token space. Operating on sentence embeddings instead of tokens lets a model plan in a language-agnostic abstraction layer and produce more coherent output than flat token generation, because the units it manipulates already carry the shared structure Can reasoning happen at the sentence level instead of tokens?. Latent-thought models go further, scaling reasoning along a latent dimension that is independent of parameter count Can latent thought vectors scale language models beyond parameters? — possible only because latent units are information-dense and correlated enough that adding them along that axis pays off.
There's a nice mirror image on the token side. Even within token sequences, not all tokens are equal: a small minority of high-entropy "forking" tokens carries most of the learning signal in reasoning models, and training only on that ~20% matches full updates Do high-entropy tokens drive reasoning model improvements?. Models also internally rank tokens by functional importance, preserving symbolic-computation tokens while pruning grammar and filler Which tokens in reasoning chains actually matter most?. Read together, these say the same thing from the bottom up: most tokens are redundant low-information leaves, and the real structure lives at the pivot points and the level above. Latent prediction wins because it targets that structure directly instead of paying the token tax to rediscover it.
The surprise worth carrying away: the correlation isn't a quirk of one architecture, it's a property of the data's generative geometry — which is also why a model that learns to predict latents can hold uncertainty and sample multiple solution paths in that compressed space rather than committing token-by-token Can stochastic latent reasoning help models explore multiple solutions?.
Sources 8 notes
A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.
LLM hierarchical representations arise as a direct mathematical consequence of corpus statistics, not from hierarchy-specific mechanisms. Spectral analysis of word co-occurrence matrices predicts and reproduces the same nested geometry found in trained embeddings and word2vec models.
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.