How do MIPS algorithms constrain the choice of similarity functions?

This explores why production-scale retrieval systems can't pick any similarity function they want — the MIPS (Maximum Inner Product Search) algorithms that make billion-item retrieval fast only work for one geometric shape, and that constraint quietly dictates how models score matches.

This explores why the choice of similarity function in large retrieval systems isn't free: MIPS — Maximum Inner Product Search — is the family of algorithms that lets you find the best match among billions of items in milliseconds, and it only works when 'similarity' is a dot product. The constraint runs backwards from infrastructure to model design. You'd think the smart move is to learn the richest possible scoring function, but the corpus shows the opposite pressure: a function you can't index at scale is a function you can't use in production, no matter how expressive it is.

The cleanest evidence sits in the collaborative-filtering work where Rendle and colleagues pit learned MLP similarities against a plainly-tuned dot product Why does dot product beat MLP-based similarity in practice?. An MLP is a universal function approximator — in theory it can represent any similarity, including the dot product itself. In practice it loses, and badly: it needs far more capacity and data just to match simple geometric similarity Can MLPs learn to match dot product similarity in practice?. But the deeper point for MIPS is the second failure: even if an MLP *did* learn a great similarity, you couldn't retrieve against it efficiently. There's no indexing trick that finds the top match of an arbitrary neural network over a billion candidates. The dot product wins twice — once on accuracy from inductive bias, and once because MIPS algorithms can only accelerate inner-product geometry. That's the constraint in one sentence: pick a non-dot-product similarity and you give up sub-linear retrieval.

What's interesting is that the field keeps trying to *escape* this constraint by splitting scoring into two stages — use cheap dot-product MIPS to recall candidates, then apply an expensive, expressive function only to the survivors. You can see this exact move in retrieval verification: a pooled-cosine (dot-product-style) recall pass pulls candidates fast, then a small Transformer verifier operating on full token-token interaction patterns rejects the structural near-misses that the compressed vectors waved through Can verification separate structural near-misses from topical matches?. The expensive function never has to run at billion-scale because MIPS already narrowed the field. The constraint isn't defeated — it's quarantined to the first stage, which is the only stage that has to be MIPS-compatible.

The same architectural logic — keep the part that must scale geometrically simple, push richness elsewhere — shows up in how memory and computation get split. Engram pairs O(1) lookup with learned routing rather than asking one mechanism to do both, finding that balanced allocation beats either alone Can lookup memory and computation work together better than either alone?. The recurring lesson across these notes is that retrievability is a first-class design constraint, not an afterthought: the geometry that makes search tractable (inner products, indexable vectors) is decided *before* you decide how clever your scoring can be.

If you want the punchline a curious reader might not expect: MIPS doesn't just prefer the dot product, it effectively forbids alternatives at scale, and that's why a 'worse' but indexable similarity beats a 'better' but un-indexable one. Expressiveness you can't retrieve against is expressiveness you can't deploy — so the algorithm that does the searching ends up dictating the math the model is allowed to learn.

Sources 4 notes

Why does dot product beat MLP-based similarity in practice?

Rendle et al. show properly-tuned dot products substantially beat MLP-based similarity despite MLP universality. Learning a dot product with an MLP requires large models and datasets; dot products also enable efficient retrieval at production scale through MIPS algorithms.

Can MLPs learn to match dot product similarity in practice?

Rendle et al. show that carefully tuned dot products substantially outperform learned MLP similarities in collaborative filtering. MLPs require excessive capacity and data to match simple geometric similarity, and they cannot be efficiently retrieved at scale—proving inductive bias matters more than expressiveness.

Can verification separate structural near-misses from topical matches?

A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.

Can lookup memory and computation work together better than either alone?

Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher auditing constraints on similarity-function choice in retrieval systems. The core question remains: does MIPS (Maximum Inner Product Search) force dot-product geometry, or have newer methods, models, or orchestration patterns relaxed this coupling?

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026, anchored in retrieval and LLM scaling work:
- MLP-based similarity functions underperform tuned dot products on accuracy and convergence, despite universal approximation; no efficient indexing exists for arbitrary neural similarities (~2020).
- Two-stage retrieval (cheap MIPS recall + expensive verification) quarantines the constraint to the first stage, allowing expressive scoring downstream (~2024–2026).
- Conditional memory and learned routing decouple lookup cost from scoring richness; balanced allocation (O(1) lookup + learned routing) outperforms joint optimization (~2026).
- Sparse autoencoders and diffusion-based retrieval introduce new basis representations; compositionality and out-of-distribution sensitivity remain open (~2024–2025).

Anchor papers (verify; mind their dates):
- arXiv:2005.09683 (Neural Collaborative Filtering, 2020)
- arXiv:2601.07372 (Conditional Memory via Scalable Lookup, 2026)
- arXiv:2604.16351 (Training for Compositional Sensitivity, 2026)
- arXiv:2405.08366 (Sparse Autoencoders for Interpretability, 2024)

Your task:
(1) RE-TEST: For each constraint above, ask whether recent advances in (a) multi-token attention or hierarchical decoding, (b) learned routing and sparse MoE architectures, or (c) retrieval-augmented generation with adaptive verifiers have since relaxed the dot-product mandate. Separate durable tension (sub-linear retrieval cost vs. scoring expressiveness) from perishable limitation (no efficient indexing for neural similarities). Cite what resolved it.
(2) Surface the strongest work from the last 6 months that contradicts the "MIPS forces dot products" claim—look for methods that index non-Euclidean similarities, learned rerankers with sublinear amortization, or hybrid recall+verification that approaches true end-to-end learning.
(3) Propose two questions that assume the regime may have shifted: (a) Can learned routing and conditional sparsity fully decouple lookup efficiency from similarity geometry? (b) Do diffusion-based or SAE-based retrieval representations enable sub-linear search over expressive, non-inner-product similarities?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do MIPS algorithms constrain the choice of similarity functions?

Sources 4 notes

Next inquiring lines