Why do single vectors fail at capturing negation and word order?

This explores why a single embedding vector — the kind that powers most semantic search — struggles to tell 'dog bit man' from 'man bit dog,' or 'is' from 'is not.'

This explores why packing a sentence into one fixed vector loses negation and word order. The sharpest answer in the corpus is geometric, not just empirical: unit-sphere cosine spaces force concepts into linear superposition — you essentially *add* the meaning of each word together. But addition is commutative (a+b = b+a), while language is not. 'Dog bit man' and 'man bit dog' use identical ingredients in a different arrangement, and a commutative geometry literally cannot keep them apart. Negation is the same problem in another key: 'is' and 'is not' share almost all their tokens, so they land close together on the sphere even though they mean opposites. The corpus frames this as a constraint that 'persists regardless of training procedure' — you can't train your way out of a geometry that has no place to put the distinction Why can't cosine space retrievers distinguish word order?.

A second note sharpens *what* embeddings actually measure: not relevance or logical role, but co-occurrence and association. That's why a query and a semantically related but role-reversed candidate look nearly identical — the vector encodes 'these words hang out together,' not 'this one is the subject and that one is the object' Do vector embeddings actually measure task relevance?. Negation is the extreme case: the negated thing co-occurs heavily with the thing it negates, so association pulls them together exactly when meaning pushes them apart.

The same asymmetry shows up one level up, in how models *learn* facts. The reversal curse — models trained on 'A is B' failing at 'B is A' — reveals that representations are direction-bound rather than symmetrically relational Why can't language models reverse learned facts?. It's a cousin of the word-order problem: order and direction carry meaning that a flattened representation discards. And when you test grammar directly, competence degrades predictably as sentences nest and embed — evidence that what's captured is surface heuristics, not the structural rules that make word order matter Does LLM grammatical performance decline with structural complexity?, Why do large language models fail at complex linguistic tasks?.

What you might not expect: the corpus also shows a way out, and it's geometric too. A 'polar probe' finds that inside a model's activations, syntactic relations *are* encoded — using both distance and angular position to mark the type and direction of a relation How do language models encode syntactic relations geometrically?. The information survives internally; it's the act of collapsing everything onto a single cosine-similarity sphere that throws away the angle. So the failure isn't that order and negation are unlearnable — it's that one vector is the wrong container. The fixes that follow are architectural: token-level interaction (let words compare directly instead of pre-summing) or a downstream verification step that re-checks order and polarity after retrieval Why can't cosine space retrievers distinguish word order?.

Sources 6 notes

Why can't cosine space retrievers distinguish word order?

Unit-sphere cosine spaces force concepts into linear superposition, a commutative structure that cannot robustly represent non-commutative distinctions like "dog bit man" versus "man bit dog." This geometric constraint persists regardless of training procedure and requires architectural alternatives like token-level interaction or downstream verification.

Do vector embeddings actually measure task relevance?

Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.

Why can't language models reverse learned facts?

Autoregressive training encodes directional associations rather than symmetric relations. Models trained on "A is B" cannot reliably retrieve answers for "B is A," revealing that knowledge representation is format-bound rather than abstractly relational.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about why single vectors fail at negation and word order. The question remains open: can a fixed-dimensional representation ever encode both semantic content AND structural information like polarity and syntax?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as snapshots, not current fact.
- Unit-sphere cosine spaces are *geometrically* hostile to non-commutative structure because addition is commutative; 'A bit B' and 'B bit A' collapse into the same vector (2023–2025).
- Embeddings measure co-occurrence and association, not logical role or task relevance; negation fails because the negated concept co-occurs heavily with its opposite (2023–2025).
- The reversal curse (models trained on 'A is B' fail at 'B is A') reveals representations are direction-bound, not symmetrically relational; grammatical competence degrades predictably with nesting (2023–2025).
- A polar coordinate system in LLM internal activations *does* encode both type and direction of syntactic relations; the failure is collapsing to a single cosine-similarity sphere, not unlearnable structure (2024–2025).
- Embedding-based retrieval faces theoretical limitations that may be fundamental to how fixed vectors compress meaning (2025).

Anchor papers (verify; mind their dates):
- arXiv:2309.12288 (2023) — The Reversal Curse
- arXiv:2412.05571 (2024) — Polar coordinate system in LLM activations
- arXiv:2503.19260 (2025) — Linguistic Blind Spots
- arXiv:2508.21038 (2025) — Theoretical Limitations of Embedding-Based Retrieval

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above—especially the geometric impossibility claim and the co-occurrence-vs.-role distinction—judge whether newer models, training methods, architectural changes (multi-head attention, mixture-of-experts, structured latents), or hybrid retrieval (dense + sparse, dense + syntax-aware verification) have since relaxed or overturned it. Separate the durable question (likely: can *any* fixed vector capture both content and structure?) from the perishable limitation (e.g., 'current cosine spheres cannot'). Cite what resolved it.
(2) Surface the strongest *contradicting* work from the last ~6 months. Are there models or methods that DO capture negation and order in a single vector? What is their trick, and does it really solve the problem or sidestep it?
(3) Propose 2 research questions that *assume* the regime may have moved: (a) If structured latents or learned coordinate systems inside transformers can encode polarity and syntax, can we *extract* that structure and use it for better retrieval? (b) Do retrieval systems that augment fixed embeddings with a lightweight order-and-negation verification step now outperform retrieval-only baselines? By how much, and at what latency cost?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do single vectors fail at capturing negation and word order?

Sources 6 notes

Next inquiring lines