How do latents at the same hierarchy level become more correlated than tokens?

This explores why representations learned at a given level of abstraction (latents) share more statistical structure with each other than the raw tokens they were built from — and what that buys a model that learns to predict its own latents instead of the next token.

This explores why latents at the same hierarchy level end up more correlated than tokens, and why that correlation is the whole point. The cleanest answer comes from a sample-complexity argument: when data is generated by a compositional hierarchy, the tokens at the bottom are the noisy, high-variance leaves — many different surface tokens can express the same underlying chunk, so any two tokens share relatively little mutual information. The latents one level up are the shared causes behind those tokens. Because they sit closer to the generative structure, sibling latents at the same level co-vary tightly. Predicting your own latents (the data2vec/JEPA family) therefore recovers the hierarchy with a number of samples that stays constant in depth, while token-level next-word prediction needs exponentially more — precisely because same-level latents are far more correlated than raw tokens Why is predicting latents more sample-efficient than tokens?.

Where does that correlation structure come from in the first place? You don't need a hierarchy-building mechanism — it falls out of corpus statistics. Spectral analysis of word co-occurrence matrices reproduces the same nested geometry found in trained embeddings, meaning the hierarchy is latent in the data and the model merely reads it off Where does hierarchical structure in language models come from?. The leading eigenvectors of that structure split taxonomy coarse-to-fine — broad branches first, finer sub-branches after — mirroring the WordNet tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. Each eigenvector level is a coherent band of correlated directions; that banding is exactly the same-level correlation the sample-efficiency result exploits.

The practical payoff shows up when models are built to reason in latent space rather than token space. Operating on sentence embeddings instead of tokens lets a model plan in a language-agnostic abstraction layer and produce more coherent output than flat token generation, because the units it manipulates already carry the shared structure Can reasoning happen at the sentence level instead of tokens?. Latent-thought models go further, scaling reasoning along a latent dimension that is independent of parameter count Can latent thought vectors scale language models beyond parameters? — possible only because latent units are information-dense and correlated enough that adding them along that axis pays off.

There's a nice mirror image on the token side. Even within token sequences, not all tokens are equal: a small minority of high-entropy "forking" tokens carries most of the learning signal in reasoning models, and training only on that ~20% matches full updates Do high-entropy tokens drive reasoning model improvements?. Models also internally rank tokens by functional importance, preserving symbolic-computation tokens while pruning grammar and filler Which tokens in reasoning chains actually matter most?. Read together, these say the same thing from the bottom up: most tokens are redundant low-information leaves, and the real structure lives at the pivot points and the level above. Latent prediction wins because it targets that structure directly instead of paying the token tax to rediscover it.

The surprise worth carrying away: the correlation isn't a quirk of one architecture, it's a property of the data's generative geometry — which is also why a model that learns to predict latents can hold uncertainty and sample multiple solution paths in that compressed space rather than committing token-by-token Can stochastic latent reasoning help models explore multiple solutions?.

Sources 8 notes

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Where does hierarchical structure in language models come from?

LLM hierarchical representations arise as a direct mathematical consequence of corpus statistics, not from hierarchy-specific mechanisms. Spectral analysis of word co-occurrence matrices predicts and reproduces the same nested geometry found in trained embeddings and word2vec models.

Do embedding eigenvectors organize taxonomy from coarse to fine?

Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about latent-space correlation and hierarchy in language models. The question remains: why do latents at the same hierarchy level correlate more strongly than tokens, and what does that correlation enable?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as claims to verify, not settled fact.
- Same-level latents are far more correlated than tokens because tokens are noisy leaves of a compositional hierarchy; predicting latents needs constant sample complexity in depth, while token prediction needs exponential samples (2026).
- Hierarchical geometry emerges directly from corpus statistics (word co-occurrence spectral structure); no dedicated mechanism builds it, and leading eigenvectors split taxonomy coarse-to-fine, mirroring WordNet levels (2026).
- Operating in latent space (sentence embeddings, latent-thought models) enables language-agnostic reasoning and adds orthogonal scaling dimensions beyond parameters (2024–2025).
- Within tokens, ~20% of high-entropy "forking" tokens drive learning in reasoning; models internally rank tokens by functional importance (2025–2026).
- Latent reasoning can remain stochastic, sampling multiple solution paths in compressed space rather than committing token-by-token (2026).

Anchor papers (verify; mind their dates):
- 2305.14825 (May 2023): In-context semantic reasoning vs. symbolic reasoning.
- 2412.06769 (Dec 2024): Reasoning in continuous latent space.
- 2605.27734 (May 2026): Sample-complexity theory of learning from latents.
- 2605.23821 (May 2026): Hierarchical concept geometry from co-occurrence.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above—especially the sample-complexity exponential gap, the spectral emergence claim, and latent-space scaling gains—judge whether newer models (2026–present), training methods, or evaluation tools have since relaxed or overturned it. Separate the durable question (likely still: why does latent prediction outperform tokens?) from perishable limitations (possible: the exponential sample gap, the specific geometry of eigenvector splits). Cite what resolved it; flag what still holds.
(2) Surface the strongest contradicting or superseding work from the last ~6 months. Has token-level prediction caught up via scaling or architectural changes? Do latent models actually suffer from information loss that limits reasoning depth?
(3) Propose 2 research questions that assume the regime may have moved—e.g., "Do frontier models still exhibit the 20% high-entropy token skew, or does scaling flatten it?" or "Can latent-space reasoning scale to multi-million-step planning without entropy collapse?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do latents at the same hierarchy level become more correlated than tokens?

Sources 8 notes

Next inquiring lines