How do embedding dimension limits constrain what concept models can represent?
This explores a hard ceiling question: whether the fixed size of an embedding vector (its dimension count) sets a mathematical limit on what a model built on those embeddings can actually represent — and what gets lost when it does.
This explores whether the fixed dimensionality of an embedding vector imposes a real, provable ceiling on what concept models can represent — and the corpus has a surprisingly sharp answer at one end and a more textured one at the other. The cleanest result comes from communication-complexity theory: for any embedding dimension *d*, there's a maximum number of top-*k* document combinations that can ever be returned, and the limit holds even when the embeddings are optimized directly on the test data Do embedding dimensions fundamentally limit retrievable document combinations?. In other words, this isn't a training problem you can fix with more data — it's geometry. A vector of fixed width simply cannot encode arbitrarily many distinct relationships at once, and the failure shows up on tasks that look trivially easy.
But 'limit' cuts two ways, and the more interesting story is what models *choose* to spend their dimensions on. Compared against humans using rate-distortion theory, LLMs aggressively maximize compression — they nail broad category structure but throw away the fine-grained, context-sensitive distinctions humans preserve Do LLMs compress concepts more aggressively than humans do?. So the constraint isn't only 'how many things fit' but 'what resolution survives.' That trade-off is visible in how embedding space is organized internally: the leading eigenvectors of embedding matrices split concepts coarse-to-fine, separating broad taxonomic branches first and only progressively resolving finer ones Do embedding eigenvectors organize taxonomy from coarse to fine?. Dimensions get allocated top-down, which is exactly why detail is the first casualty when budget runs short.
There's a hopeful counter-current, though: a fixed number of dimensions can hold far more than a naive count suggests, because models exploit *structured* geometry rather than spending one dimension per fact. The Polar Probe shows syntax encoded in polar coordinates — using both distance *and* angle between embeddings, nearly doubling accuracy over distance-only readings How do language models encode syntactic relations geometrically?. And even static, pre-attention embeddings already carry rich semantic content like valence, concreteness, and iconicity Do transformer static embeddings actually encode semantic meaning?. The same space is being read along multiple axes at once, which stretches what a given dimensionality can mean.
This reframes the whole concept-model design question. Meta's Large Concept Model bets that reasoning at the sentence-embedding level — in a language-agnostic space — produces more coherent output than flat token generation Can reasoning happen at the sentence level instead of tokens?, but if a sentence's full meaning has to survive compression into one fixed vector, the retrieval limit and the compression bias both bear directly on whether that vector can carry the nuance the task needs. Other work routes around the bottleneck rather than fighting it: latent-thought models add scaling dimensions *independent* of parameters Can latent thought vectors scale language models beyond parameters?, and small models do better going deep-and-thin — composing abstract concepts across layers — than spreading capacity across width Does depth matter more than width for tiny language models?. The shared lesson: you escape a dimensional ceiling not by widening the vector but by adding structure — layers, polar geometry, sequential composition.
The thing worth walking away with is that representational capacity and representational *integrity* are different limits. A model can have enough dimensions to be linearly decodable on a task while its internal organization is fractured and fragile under distribution shift Can models be smart without organized internal structure?. So 'can the embedding represent it?' and 'does the embedding represent it in a way that holds up?' are separate questions — and the dimension count constrains the first while saying little about the second.
Sources 9 notes
Communication complexity theory proves that for any embedding dimension d, there exists a maximum number of top-k document combinations that can be returned as results. Even embeddings optimized directly on test data hit this polynomial limit, demonstrated on trivially simple retrieval tasks.
Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.