Can one model memorize and generalize better than two?
Does training memorization and generalization components jointly in a single model outperform training them separately and combining their predictions? This matters for building efficient recommendation systems that handle both rare and common user behaviors.
The recommender ranking problem demands two opposing capabilities. Memorization — learning frequent feature co-occurrences — fits cross-product transformations on sparse features (e.g. "user installed Netflix AND impression is Pandora"). Generalization — predicting on never-seen feature combinations — fits dense embeddings that map sparse features into continuous space. Cross-products fail to generalize across unseen pairs; dense embeddings over-generalize and produce nonzero predictions for niche items the user shouldn't see.
The Wide & Deep insight is that you can put both halves into one model and train them jointly rather than ensembling separately-trained models. The technical distinction matters: in an ensemble, each component is optimized standalone and combined at inference, which means each must be full-size to perform reasonably alone. In joint training, the wide part is optimized knowing the deep part exists, so it only needs to complement the deep model's weaknesses with a small number of cross-product features. Gradients from the output flow into both branches simultaneously via mini-batch SGD.
The result is a model where memorization and generalization are not two competing strategies forced into a weighted average but two specialized roles in one network. The wide part becomes small and surgical because the deep part handles the common cases; the deep part doesn't have to learn rare exceptions because the wide part captures them.
Inquiring lines that use this note as a source 18
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How should memory consolidation timing differ across multiple timescales?
- How much does memorization capacity limit a model's ability to learn new information?
- How do the three grokking phases connect to memorization capacity limits?
- Can data pruning strategies exploit the finite nature of memorization capacity?
- Can a single model trained on two tasks predict untrained decision tasks?
- How does training frequency distribution shape what models reliably retrieve?
- How do layer-wise versus parameter-wise merging strategies affect information retention?
- How does distributional shift toward rare inputs change memorization reliance?
- Why do cross-product features memorize better than dense embeddings?
- How does joint backpropagation differ from training separate ensemble models?
- Why do cross-product features fail to generalize across unseen feature combinations?
- Can the joint-training principle extend beyond memorization and generalization pairs?
- How do memorization and attention map onto different memory systems?
- What distinguishes data that generalizes broadly from task-specific memorization?
- How do complementary learning systems explain the need for fast and slow consolidation?
- How can a forgetting policy preserve rare knowledge while preventing over-generalization?
- Why does semantic deduplication reduce memorization in fine-tuned models?
- Can training order and structure shape what networks retain and learn?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can one model handle both memorization and generalization?
Recommenders face a tradeoff between memorizing seen patterns and generalizing to new ones. Can a single architecture satisfy both needs without the cost of ensemble methods?
extends: paired statement of the same Wide&Deep result emphasizing the parameter-efficiency angle
-
Can autoencoders solve the cold-start problem in recommendations?
Explores whether deep autoencoders combining collaborative filtering with side information can overcome the cold-start problem where new users or items lack rating history.
extends: same hybridization-via-joint-training principle generalizes beyond CF+CBF to memorization+generalization
-
How can user vectors capture diverse interests without exploding in size?
Fixed-length user vectors compress all interests into one representation, losing information about varied tastes. Can we represent diverse interests efficiently without expanding dimensionality?
complements: DIN extends Wide&Deep with attention against candidates — same industrial-architecture lineage
-
Why do ranking systems need to model selection bias explicitly?
Explores how training data from current rankers creates feedback loops that reinforce past decisions. Understanding this mechanism helps explain why naive approaches fail in production ranking systems.
complements: MMoE generalizes joint-training to multi-task multi-objective — same joint-vs-separate optimization lesson
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Wide & Deep Learning for Recommender Systems
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- Scaling can lead to compositional generalization
- Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
- How much do language models memorize?
- How new data permeates LLM knowledge and how to dilute it
- Nested Learning: The Illusion of Deep Learning Architecture Expanded
- Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective
Original note title
wide-and-deep models combine memorization and generalization through joint training not ensembling