How should meaning spaces be systematically modeled across different applications?
This explores how 'meaning' actually gets encoded inside language models — the geometry, dimensions, and units of their internal representation spaces — and whether the corpus points toward any shared, systematic way to model meaning rather than ad hoc per-task tricks.
This reads the question as: is there a principled structure to how meaning lives inside these models, and can we model it the same way across tasks? The corpus suggests the answer is converging — meaning spaces turn out to be surprisingly *structured*, but that structure also imposes hard limits.
The most striking thread is geometric. Several notes find that meaning isn't smeared randomly across high dimensions but organizes along a few stable axes. One shows that twenty-eight semantic features in LLM embeddings collapse into just three principal components that mirror the human evaluation–potency–activity structure psychologists have measured for decades Do LLM semantic features organize along human evaluation dimensions?. Another finds models encode syntactic relations in something like *polar coordinates* — distance for type, angle for direction — so the geometry itself carries symbolic meaning How do language models encode syntactic relations geometrically?. Even static, pre-attention embeddings already cluster by valence, concreteness, and iconicity, meaning real semantic content is loaded before the model does any contextual work Do transformer static embeddings actually encode semantic meaning?. The recurring lesson: meaning spaces are low-dimensional, reusable, and quasi-symbolic — which is exactly what makes 'systematic modeling' plausible.
The question of the right *unit* runs alongside this. Meta's Large Concept Model argues the unit shouldn't be the token at all — it reasons over whole-sentence embeddings in a language-agnostic space, then decodes to any language, treating meaning as something that exists above words Can reasoning happen at the sentence level instead of tokens?. A complementary note finds that not all tokens carry equal semantic weight: models internally rank them, preserving symbolic-computation tokens while discarding grammar and filler Which tokens in reasoning chains actually matter most?. So a meaning space can be modeled coarsely (sentences/concepts) or finely (functionally-weighted tokens), and the application decides which granularity matters.
But the corpus also names the catch, and this is the part worth knowing. The low-dimensional structure that makes meaning modelable also *entangles* it — intervene on one semantic feature and aligned features shift proportionally, producing unavoidable off-target effects Do LLM semantic features organize along human evaluation dimensions?. And there's a deeper worry that what looks like a meaning space may partly be a *frequency* space: models systematically prefer high-frequency paraphrases over semantically identical rare ones, suggesting they track statistical mass rather than meaning per se Do language models really understand meaning or just surface frequency?. That fragility shows up at the edges — models fail to hold multiple interpretations of ambiguous text at once Can language models recognize when text is deliberately ambiguous?, and their representations degrade predictably as syntactic structure deepens Why do large language models fail at complex linguistic tasks?.
So the honest synthesis: across applications, meaning spaces should be modeled as low-dimensional, geometrically structured, and granularity-flexible — and the same notes warn you can't treat one feature as independent of the rest, and you should check whether you're modeling meaning or just frequency. The fact that o1 can produce genuine metalinguistic analyses — building syntax trees through explicit reasoning — hints these spaces are rich enough to be inspected and described, not just used Can language models actually analyze language structure?.
Sources 9 notes
Twenty-eight semantic axes in LLM embeddings reduce to three principal components matching human EPA structure. Intervening on one feature predictably shifts aligned features proportionally, creating unavoidable off-target effects that reflect how meaning is fundamentally organized.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.