Why are polysemantic features concentrated in early neural network layers?

This explores why the 'messy' features that fire for many unrelated concepts at once (polysemantic ones) tend to pile up near a network's input layers rather than its deeper ones — and the corpus doesn't tackle polysemanticity head-on, but several notes circle the same territory of how features get cleaner as you go deeper.

This explores why multi-meaning, entangled features concentrate early in a network. No note here studies polysemanticity or superposition by name, but a few converge on a clean explanation: early layers sit closest to raw tokens, where many surface forms must be crammed into limited dimensions, and the disentangling happens later. The sharpest evidence is circuit tracing in Claude models, which finds a four-tier progression — token-level inputs → abstract concepts → functional operations → outputs How do language models organize features across processing layers?. The bottom tier is exactly where you'd expect crowding: a single early unit has to participate in representing every word that shares a spelling, a subword, or a context, so it ends up firing for a grab-bag of meanings. Abstraction — and the room to give concepts their own clean directions — only arrives deeper in.

Why is the early layer forced to be lossy and entangled? One note argues that the geometry of language models isn't hand-built but falls directly out of word co-occurrence statistics Where does hierarchical structure in language models come from?. Words that appear in overlapping contexts start life tangled together; nothing has yet pulled them apart. Early representations inherit that raw statistical mush, and it's the job of later layers to carve the nested, separable structure out of it. So polysemanticity early isn't a bug so much as the unprocessed input distribution showing through before the network has done its work.

The 'depth does the disentangling' idea gets independent support from architecture experiments: deep-and-thin small models beat wide ones because stacking layers lets the network compose abstract concepts step by step rather than packing everything into a single wide bottleneck Does depth matter more than width for tiny language models?. If composition is what depth buys you, then the early, pre-composition layers are necessarily the ones doing broad, overloaded, many-meanings-per-unit encoding. Relatedly, compositional generalization tends to track how *linearly decodable* a concept's constituents are from the hidden activations Can neural networks learn compositional skills without symbolic mechanisms? — and clean linear decodability is a deep-layer property, the opposite of the overlapping mixtures you find at the input.

Two more notes hint at the flip side — what 'cleaned up' looks like. Networks naturally sort compositional work into isolated, modular subnetworks, and pretraining makes that modularity more reliable Do neural networks naturally learn modular compositional structure?; and hidden states actively *sparsify* — fewer units firing, more selectively — when a task gets hard or unfamiliar Do language models sparsify their activations under difficult tasks?. Both describe representations becoming dedicated and selective, which is precisely the regime polysemantic early features are not in. The unexpected payoff here: polysemanticity may be less an intrinsic property of 'early layers' and more a symptom of proximity to raw, uncompressed input statistics — depth, modularity, and sparsification are all names for the same process of pulling those tangled meanings apart.

Sources 6 notes

How do language models organize features across processing layers?

Circuit tracing in Claude models reveals features progress from token-level inputs to abstract concepts to functional operations to outputs. Larger models develop richer abstract features, suggesting scaling enables higher-level conceptual reasoning rather than pattern memorization.

Where does hierarchical structure in language models come from?

LLM hierarchical representations arise as a direct mathematical consequence of corpus statistics, not from hierarchy-specific mechanisms. Spectral analysis of word co-occurrence matrices predicts and reproduces the same nested geometry found in trained embeddings and word2vec models.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Why are polysemantic features concentrated in early neural network layers?

Sources 6 notes

Next inquiring lines