Why does dot product beat MLP-based similarity in practice?
Neural Collaborative Filtering theory suggests MLPs should outperform dot products as universal approximators. But what explains the empirical gap, and what role do data scale and deployment constraints play?
Neural Collaborative Filtering popularized replacing the dot product between user and item embeddings with a learned MLP, on the theory that an MLP — a universal function approximator — should subsume the dot product as a special case. Rendle and colleagues revisit the experiments and show two non-obvious results.
First, with proper hyperparameter tuning, the simple dot product substantially outperforms the MLP-based similarity. The original NCF gain came from undertuning the dot-product baseline, not from MLP expressiveness. Second, even though an MLP can in theory approximate any function, learning a dot product with an MLP requires both a large model and a large training set — the inductive bias of MLPs makes the dot-product structure expensive to recover from data.
The practical bite is in inference. Dot products admit Maximum Inner Product Search algorithms that retrieve top-K items in sublinear time over millions of items. MLP similarities require a forward pass per (user, item) pair, which is intractable at production scale. The paper concludes that MLPs as embedding combiners should be "used with care" — that the modern DNN architectures most competitive in NLP (transformers) and vision (resnets) all use dot products in their output layers reinforces the point. Universal approximation does not mean universal good choice; the inductive bias of the operator interacts with data scale and serving constraints.
Inquiring lines that use this note as a source 10
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does the zero-diagonal constraint enable generalization in collaborative filtering?
- What makes dot product efficient for real-time retrieval over millions of items?
- How do MIPS algorithms constrain the choice of similarity functions?
- Can simpler collaborative filtering models outperform deep architectures?
- What attentional bias objectives compete with dot product similarity for associative memory?
- Why do transductive recommenders fail where inductive learning succeeds?
- Can models retrieve the right tool without relying on vector similarity?
- Why do cross-product features fail to generalize across unseen feature combinations?
- Why is a combinatorial framework better than family resemblance classification?
- How should practitioners measure similarity between embeddings safely?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can MLPs learn to match dot product similarity in practice?
Universal approximation theory suggests MLPs should learn any similarity function, including dot product. But does this theoretical promise hold up when training on real, finite datasets with practical constraints?
extends: paired statement of the same Rendle result emphasizing the practical infeasibility of efficient retrieval
-
Can simpler models beat deep networks for recommendation systems?
Does removing hidden layers and constraining self-similarity create a more effective collaborative filtering approach than deep autoencoders? This challenges the assumption that architectural depth drives performance.
complements: same lesson at architecture level — the right structural constraint beats depth
-
Can a linear model beat deep collaborative filtering?
Does a shallow linear autoencoder with a zero-diagonal constraint outperform deeper neural models on collaborative filtering tasks? This challenges the field's assumption that depth and nonlinearity drive performance.
complements: same anti-depth lesson — anti-affinity and dot-product priors both outperform learned alternatives
-
Can one model memorize and generalize better than two?
Does training memorization and generalization components jointly in a single model outperform training them separately and combining their predictions? This matters for building efficient recommendation systems that handle both rare and common user behaviors.
complements: industrial systems use simple structural priors (wide cross-product) for memorization rather than relying on MLP universality
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Neural Collaborative Filtering vs. Matrix Factorization Revisited
- Curse of “Low” Dimensionality in Recommender Systems
- KAN: Kolmogorov-Arnold Networks
- Deep Interest Network for Click-Through Rate Prediction
- Titans: Learning to Memorize at Test Time
- On the Theoretical Limitations of Embedding-Based Retrieval
- Embarrassingly Shallow Autoencoders for Sparse Data*
- Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities
Original note title
MLP-based similarity underperforms dot product despite being a universal function approximator — inductive bias matters more than capacity