How do corpus statistics shape the abstraction hierarchy in language model representations?

This explores where the layered, nested structure in language model representations actually comes from — and the corpus's sharpest answer is that it falls out of the raw statistics of which words appear near which, no special machinery required.

This explores where the layered, nested structure in language model representations comes from — the question assumes models build an abstraction hierarchy, and the corpus's most direct answer is that they don't build it so much as inherit it from the math of word co-occurrence. The standout finding is that hierarchical concept geometry needs no dedicated mechanism: it emerges as a mathematical consequence of corpus statistics, where a spectral analysis of which words appear together predicts and reproduces the same nested geometry found inside trained embeddings and even old word2vec models Where does hierarchical structure in language models come from?. In other words, the shape of the data writes the shape of the representation.

If statistics sculpt the hierarchy, then statistical imbalance warps it — and the corpus has several notes showing exactly that. Models build shallower, weaker representations of whatever is under-represented in training: historical legal cases get systematically worse treatment than modern ones because recent cases dominate the corpus Why do language models struggle with historical legal cases?, and low-resource cultures get routed through high-resource cultural proxies as a structural pathway inside the model, not just a surface slip Do LLMs represent low-resource cultures through dominant cultural proxies?. The abstraction hierarchy isn't a neutral ladder; its rungs are spaced by how often the data talks about something.

The same statistical origin explains a ceiling on what the hierarchy can hold. Because representations track co-occurrence rather than rules, models capture surface patterns but miss deep grammatical structure — they reliably stumble on embedded clauses and complex nominals, and the errors get predictably worse as syntactic depth increases Why do large language models fail at complex linguistic tasks?. Push further and reasoning itself turns out to ride on semantic association rather than symbolic structure: strip the familiar meaning out of a task and performance collapses even when the rules are sitting right there in context Do large language models reason symbolically or semantically?. A hierarchy assembled from word statistics is excellent at the statistically frequent and brittle at the structurally deep.

There's a useful tension here worth chasing. One line of work suggests the hierarchy can be climbed deliberately rather than just absorbed: deep-and-thin architectures beat wide ones at small scale precisely by composing abstract concepts across layers Does depth matter more than width for tiny language models?, and chain-of-thought reasoning lets a model construct genuine syntactic trees and phonological generalizations it can't produce in a single pass Can language models actually analyze language structure?. So the static geometry handed to you by corpus statistics is one thing; what extra depth or explicit reasoning steps can build on top of it is another.

The quiet payoff: the abstraction hierarchy you can interrogate inside a model is largely a fossil of its training distribution. That reframes a lot of failures — context being overridden by strong priors Why do language models ignore information in their context?, low-probability tasks like reversing the alphabet being hard for reasons that have nothing to do with logical difficulty Can we predict where language models will fail? — as not bugs in the reasoning but shadows cast by the statistics that built the representations in the first place.

Sources 9 notes

Where does hierarchical structure in language models come from?

LLM hierarchical representations arise as a direct mathematical consequence of corpus statistics, not from hierarchy-specific mechanisms. Spectral analysis of word co-occurrence matrices predicts and reproduces the same nested geometry found in trained embeddings and word2vec models.

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Do LLMs represent low-resource cultures through dominant cultural proxies?

Mechanistic interpretability analysis reveals that low-resource cultures like Ethiopia and Algeria are structurally represented through high-resource cultural proxies in internal model states, not just output. This architectural bias persists even when models can produce correct surface-level answers.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

How do corpus statistics shape the abstraction hierarchy in language model representations?

Sources 9 notes

Next inquiring lines