How do static embeddings and contextualized representations divide semantic labor?

This explores the question of who does what: how much meaning lives in a word's static embedding (its baseline lexical entry, before any sentence is processed) versus how much is built on the fly by attention as the model reads context.

This explores the division of labor between static embeddings — the fixed vector a word carries before the model reads anything around it — and contextualized representations, the activations attention builds as it processes a sentence. The corpus suggests the split is real and surprisingly principled: static embeddings already carry a heavy load of meaning, and attention specializes in everything that depends on neighbors.

The striking finding is how much semantic work happens before attention even fires. Clustering of RoBERTa's static embeddings shows sensitivity to valence, concreteness, iconicity, and taboo — psycholinguistic properties we'd assume require understanding, present in the raw lexical entry Do transformer static embeddings actually encode semantic meaning?. The structure goes deeper than isolated word features: the leading eigenvectors of the embedding space split taxonomy coarse-to-fine, separating broad categories first and finer distinctions later, tracking the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. So static space isn't a bag of arbitrary points — it's a pre-organized semantic map, and it earns that organization purely from co-occurrence statistics, with no grounding in the world Can language models learn meaning without engaging the world?.

What attention adds is relational and directional. The Polar Probe finds that contextualized activations encode syntactic *type and direction* through angle and distance between embeddings — information about how this word relates to that one, which can't exist until both words are in play How do language models encode syntactic relations geometrically?. This is the cleaner way to read the division: static space holds *what a word means on its own*; contextualization computes *what it means here, in relation to these other words*. One framing pushes this even further — knowledge in transformers isn't stored and retrieved so much as it flows through the residual stream as activations, generated fresh in each pass rather than looked up Do transformer models store knowledge or generate it continuously?.

But the handoff between the two is contested territory, and that's the part you might not expect. The static layer's strong priors can overpower the contextual layer: models fail to integrate what's in front of them when parametric associations from training dominate, and no amount of prompting overrides it — you have to intervene in the representations directly Why do language models ignore information in their context?. A related failure shows models leaning on raw statistical mass: they consistently prefer high-frequency surface phrasings over semantically identical rare ones, suggesting the baseline layer tracks frequency, not meaning, more than we'd like Do language models really understand meaning or just surface frequency?. So the labor isn't always cleanly divided — sometimes the static priors refuse to yield the floor.

The most interesting move in the corpus is questioning whether the token is even the right unit for this split. Meta's Large Concept Model reasons over *sentence* embeddings in a language-agnostic space before decoding, suggesting the static-vs-contextual divide could be relocated to a higher level of abstraction entirely Can reasoning happen at the sentence level instead of tokens?. If you want to follow that thread, latent-thought models couple a fast local learning rate with a slow global one — a dual-rate scheme that looks a lot like formalizing the same division: a stable substrate plus a fast, context-sensitive layer on top Can latent thought vectors scale language models beyond parameters?.

Sources 9 notes

Do transformer static embeddings actually encode semantic meaning?

Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.

Do embedding eigenvectors organize taxonomy from coarse to fine?

Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher re-testing claims about static vs. contextualized semantic labor in transformers. The question remains open: *How do pre-token embeddings and attention-built activations actually divide semantic work?*

What a curated library found — spanning 2024–2026, so treat as dated claims:

• Static embeddings encode rich psycholinguistic properties (valence, concreteness, taboo) and organize hierarchically along WordNet taxonomy levels purely from co-occurrence, with no external grounding (2025-08).
• Contextualized representations add *relational and directional information*: a polar coordinate system encodes syntactic type and vector direction between tokens, information unavailable until both are in play (2024-12).
• Static priors can override context integration: models fail to update beliefs when training associations dominate, resisting both prompting and low-level interventions (2025-04).
• Models systematically prefer high-frequency surface phrasings over semantically identical rare ones, suggesting static layer tracks frequency-mass more than meaning (2026-04).
• The unit of static-vs-contextual division may not be the token: sentence-level reasoning in language-agnostic space and dual-rate latent-thought schemes relocate the split to higher abstraction (2025-02).

Anchor papers (verify; mind their dates):
• arXiv:2508.12863 — Word Meanings in Transformer Language Models (2025-08)
• arXiv:2412.05571 — A polar coordinate system represents syntax in large language models (2024-12)
• arXiv:2504.09522 — How new data permeates LLM knowledge and how to dilute it (2025-04)
• arXiv:2502.01567 — Scalable Language Models with Posterior Inference of Latent Thought Vectors (2025-02)

Your task:

(1) RE-TEST EACH CONSTRAINT. For static encodings of psycholinguistic properties and taxonomic hierarchy: have newer embeddings (e.g., from 2025–2026 models) been probed similarly? Does the hierarchy hold under distributional shift or multilingually? For the polar coordinate finding: does it generalize to all syntactic phenomena, or does it break under long-range or ambiguous contexts? For the frequency-wins claim: do retrieval-augmented or memory-augmented models show the same bias, or does external knowledge relax it? For context-integration failure: have in-context learning scaling or chain-of-thought methods since overcome the static override? Separate the durable question (how *any* system divides static and dynamic semantics) from the perishable limitation (models of 2024–2025 couldn't integrate context when priors were strong).

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Have mechanistic studies of 2026 models found that attention and embedding space are *more* entangled than the clean division suggests? Do recursive or hierarchical concept geometry papers (arXiv:2605.23821, 2025-12) challenge the binary split?

(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If latent-thought models formalize the static–contextual division at a sentence level, does that framework predict or explain failures in token-level context integration? (b) Can you design a probing suite that detects *whether* a model's architecture enforces a static–dynamic split, or does the split emerge regardless of design?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do static embeddings and contextualized representations divide semantic labor?

Sources 9 notes

Next inquiring lines