Why do leading embedding eigenvectors align with WordNet taxonomy structure?
This explores why the top eigenvectors of word-embedding matrices line up with the WordNet tree — and the corpus's answer is that it falls out of word co-occurrence statistics, not from any hierarchy-building machinery inside the model.
This explores why the leading eigenvectors of embedding matrices end up mirroring WordNet's taxonomy, and the short version from the corpus is: the hierarchy was never designed in — it's a mathematical shadow of how words co-occur in text. When you take the Gram matrix of embeddings and look at its top eigenvectors, they split the vocabulary coarse-to-fine: the broadest taxonomic branches separate first, then progressively finer sub-branches, tracking the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. That ordering isn't a coincidence the model stumbled into; it's predicted directly from co-occurrence spectral structure Where does hierarchical structure in language models come from?.
The strongest evidence that this is statistics rather than design is a convergence argument. Word2vec embeddings and Gemma 2B's unembeddings — models with completely different training objectives — show the *same* coarse-to-fine spectral signature across WordNet taxonomies Do language models use the hierarchical geometry they inherit?. If two systems built for different purposes inherit identical geometry, the geometry can't be a functional adaptation of either one. It has to come from the shared input: the text. Words that share a hypernym (robin, sparrow → bird) co-occur with overlapping contexts, and that overlap structure, when you do the linear algebra, factors into nested clusters that recover the tree.
What makes this genuinely surprising is the reframe it forces: hierarchy in language models needs no dedicated mechanism. The nested 'is-a' structure we associate with deliberate ontology emerges as a byproduct of counting which words appear near which other words Where does hierarchical structure in language models come from?. The eigenvectors are just the principal directions of variation in that co-occurrence cloud, and the largest directions happen to be the broadest semantic distinctions.
This connects to a broader theme in the corpus about what embedding geometry actually encodes — and its limits. Embeddings measure semantic *association*, the co-occurrence signal that also produces the taxonomy, which is exactly why they conflate concepts that are semantically close but play different roles Do vector embeddings actually measure task relevance?. The same statistical substrate carries other structure too: static embeddings already hold psycholinguistic content like valence and concreteness before attention even runs Do transformer static embeddings actually encode semantic meaning?, and models spontaneously develop structured geometry for syntax in polar coordinates How do language models encode syntactic relations geometrically? and for semantic features along human-like evaluation axes Do LLM semantic features organize along human evaluation dimensions?.
The takeaway you didn't know you wanted: the 'ontology' inside a language model isn't knowledge it learned about categories — it's the residue of statistics. And that has a practical edge worth noticing. If you want a model to *use* structured knowledge rather than just inherit its statistical silhouette, you may have to build the structure in explicitly, the way knowledge-graph curricula and taxonomy-organized training do Can organizing knowledge structures beat raw training data volume?, Can knowledge graphs teach models deep domain expertise?. The spectrum gives you the shape of the tree for free; it doesn't give you the reasoning that the tree is supposed to support.
Sources 9 notes
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
LLM hierarchical representations arise as a direct mathematical consequence of corpus statistics, not from hierarchy-specific mechanisms. Spectral analysis of word co-occurrence matrices predicts and reproduces the same nested geometry found in trained embeddings and word2vec models.
Word2vec embeddings and Gemma 2B unembeddings share identical coarse-to-fine spectral signatures across WordNet taxonomies. Since these models have entirely different objectives, the shared structure must originate from training text statistics rather than convergent functional needs.
Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Twenty-eight semantic axes in LLM embeddings reduce to three principal components matching human EPA structure. Intervening on one feature predictably shifts aligned features proportionally, creating unavoidable off-target effects that reflect how meaning is fundamentally organized.
StructTuning achieves 50% of full-corpus performance using only 0.3% of training data by organizing chunks into auto-generated domain taxonomies. The model learns knowledge position within conceptual structures rather than raw text patterns, matching how students learn from textbooks.
Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.