Do modern architectures in NLP and vision rely on dot products intentionally?
This asks whether the dot-product operation — attention's query-key scoring and the similarity math behind embeddings — is a deliberate design choice in transformers and vision models, or just an incidental implementation detail.
This explores whether modern architectures lean on dot products by design. The honest answer first: this collection isn't built around the mechanics of attention math, so it has no note that directly argues the case for (or against) dot products as an intentional primitive. What it does have is a set of papers that circle the same territory from the outside — they show you what the dot product is *for*, and where leaning on it starts to break down.
The clearest doorway is work that moves computation into embedding space on purpose. Meta's Large Concept Model reasons over whole-sentence embeddings in a language-agnostic space before decoding to any target language Can reasoning happen at the sentence level instead of tokens?. That only works because meaning is encoded as geometry — vectors whose closeness (a dot product) stands in for semantic relatedness. So yes, the reliance is intentional: these systems are deliberately built so that comparing two vectors *is* the act of comparing two meanings.
But the corpus is more interesting on the limits of that bet than on its design rationale. Several notes show that geometric similarity captures surface co-occurrence rather than structure. LLMs make systematic grammatical errors that worsen predictably with syntactic depth — statistical pattern-matching that never recovers deep rules Why do large language models fail at complex linguistic tasks?. And when reasoning tasks are stripped of familiar semantic content, performance collapses even with the correct rules supplied in context, because the models lean on token associations rather than formal manipulation Do large language models reason symbolically or semantically?. Both are symptoms of an architecture that scores similarity beautifully but doesn't natively do symbolic structure.
The architecture-shape papers add a different angle. MobileLLM finds that depth beats width at small scale — stacking layers to compose abstract concepts outperforms spreading parameters wide Does depth matter more than width for tiny language models? — and logit-lens work shows transformers compute answers in early layers and then overwrite them Do transformers hide reasoning before producing filler tokens?. These tell you the real expressive power comes from *composing* many similarity-based operations across depth, not from any single dot product doing the heavy lifting.
So the thing you might not have known you wanted to know: the dot product isn't where intelligence lives in these models — it's the cheap, repeatable comparison operator that gets stacked thousands of times so that depth and composition can do the actual work. The corpus is rich on what that stacking can and can't deliver, even though it never names the dot product as the protagonist. If you want material specifically on attention mechanics or vision-transformer design choices, that's a gap here worth flagging rather than papering over.
Sources 5 notes
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.