Recommender Systems

Research on building and improving systems that suggest content, products, or information to users. Covers neural architectures, conversational interfaces, LLM-based approaches, and personalization methods for learning and modeling user preferences.

75 notes (primary) · 92 papers · 4 sub-topics

View as

Recommender Architectures

34 notes

Can simpler models beat deep networks for recommendation systems?

Does removing hidden layers and constraining self-similarity create a more effective collaborative filtering approach than deep autoencoders? This challenges the assumption that architectural depth drives performance.

Can a linear model beat deep collaborative filtering?

Does a shallow linear autoencoder with a zero-diagonal constraint outperform deeper neural models on collaborative filtering tasks? This challenges the field's assumption that depth and nonlinearity drive performance.

Do LLM explanations faithfully describe their recommendation process?

When LLMs recommend items to groups, do their explanations match how they actually made the choice? This matters because users trust explanations to understand AI decision-making.

Can we distill LLM knowledge into graphs for real-time recommendations?

E-commerce needs sub-millisecond recommendations, but LLMs are too slow. Can we extract LLM insights offline into a knowledge graph that serves requests in production without sacrificing quality or explainability?

Can MLPs learn to match dot product similarity in practice?

Universal approximation theory suggests MLPs should learn any similarity function, including dot product. But does this theoretical promise hold up when training on real, finite datasets with practical constraints?

Why does Netflix use multiple ranking systems instead of one?

Netflix's homepage combines five distinct rankers optimizing different signals and time horizons. The question explores whether a single unified ranker could serve all user intents or if architectural separation is necessary.

What does Netflix need to optimize in those first 90 seconds?

Streaming users abandon after 60-90 seconds reviewing 1-2 screens. Does the recommender problem lie in predicting ratings accurately, or in making those limited screens immediately compelling?

Can reinforcement learning align summarization with ranking goals?

Generic LLM summaries optimize for readability, not ranking performance. Can training summarizers with downstream relevance scores as rewards fix this misalignment and produce summaries that actually help rankers match queries?

Can graph structure patterns outperform direct edge signals in noisy data?

When user-behavior data is messy and unreliable, does looking at structural patterns across multiple edges produce better product recommendations than counting simple co-occurrences? This matters because e-commerce platforms need robust substitute graphs at billion-scale.

Do accuracy-optimized recommendations preserve user interest diversity?

Standard recommender systems rank by predicted relevance, which tends to saturate lists with the highest-confidence items. Does this approach naturally preserve the proportions of a user's multiple interests, or does it systematically crowd out smaller ones?

Why do accuracy-optimized recommenders crowd out minority interests?

Explores why recommendation models that maximize accuracy systematically over-represent a user's dominant interests while suppressing their lesser ones, even when both are measurable and real.

Can discrete codes transfer better than text embeddings?

Does inserting a discrete quantization layer between text and item representations improve cross-domain transfer in recommenders? This explores whether decoupling text from final embeddings reduces domain gap and text bias.

Can smaller models outperform their LLM teachers with enough data?

Explores whether student models trained on expanded teacher-generated labels can exceed teacher performance in production ranking tasks, and what data scale makes this possible.

Can model isolation solve streaming recommendation better than replay?

When continuously arriving user data arrives, does isolating parameters per task provide better control over forgetting old patterns while learning new ones than experience replay or knowledge distillation approaches?

Why do hash collisions hurt recommendation models so much?

Explores whether standard low-collision hashing works for embedding tables in recommenders, given that user and item frequencies follow power-law distributions rather than uniform ones.

When can greedy bandits skip exploration entirely?

Under what conditions does natural randomness in incoming contexts eliminate the need for active exploration in contextual bandits? This matters for high-stakes domains like medicine where exploration carries real costs.

How can user vectors capture diverse interests without exploding in size?

Fixed-length user vectors compress all interests into one representation, losing information about varied tastes. Can we represent diverse interests efficiently without expanding dimensionality?

Can autoencoders solve the cold-start problem in recommendations?

Explores whether deep autoencoders combining collaborative filtering with side information can overcome the cold-start problem where new users or items lack rating history.

Can implicit feedback reveal both preference and confidence?

When users take implicit actions like purchases or watches, do those signals carry two separable pieces of information: what they prefer and how certain we should be? Explicit ratings can't make that distinction.

Can graphs unify collaborative filtering and side information?

How might merging user-item interactions with item attributes into a single graph structure allow recommendation systems to capture collaborative and attribute-based signals together, rather than separately?

Why do ranking systems need to model selection bias explicitly?

Explores how training data from current rankers creates feedback loops that reinforce past decisions. Understanding this mechanism helps explain why naive approaches fail in production ranking systems.

Why does multinomial likelihood work better for click prediction?

Explores whether the choice of likelihood function—multinomial versus Gaussian or logistic—affects recommendation performance, and what structural properties make one better suited to modeling user clicks.

Why does multinomial likelihood work better for ranking recommendations?

Explores whether the choice of likelihood function in VAE-based collaborative filtering matters for matching training objectives to ranking evaluation metrics. Why items should compete for probability mass.

How can real-time recommendations stay responsive and reproducible?

In-session signals improve ranking accuracy, but requiring fresh data during sessions forces real-time computation. This creates latency, network sensitivity, and debugging challenges that offset the relevance gains.

Do hash collisions really harm popular recommendation items?

Hash-based embedding tables assume uniform ID distribution, but real recommender systems show heavy-tailed frequency patterns. The question explores whether collisions actually concentrate damage on the high-traffic entities that matter most.

Why does collaborative filtering struggle with sparse user data?

Collaborative filtering datasets appear massive but hide a fundamental challenge: each user has rated only a tiny fraction of items. How does this per-user sparsity shape the modeling problem, and what techniques can overcome it?

Can neural networks explore efficiently at recommendation scale?

Exploration—discovering unknown user preferences—normally requires expensive posterior uncertainty estimates. Can a neural architecture make Thompson sampling practical for real-world recommenders without prohibitive computational cost?

Why do recommendation systems miss recurring user preference patterns?

Most streaming recommendation systems treat preference changes as one-time drift events and discard old patterns. But user behavior often cycles—coffee shops on weekday mornings, gyms on weekends. How should systems account for these recurring periodicities instead of detecting and resetting against them?

Why do global concept drift methods fail for recommender systems?

Recommender systems treat user preferences as individuals with distinct, asynchronous preference shifts. Can standard concept-drift approaches designed for population-level changes capture this per-user heterogeneity?

Can discretizing text embeddings improve recommendation transfer?

Does inserting a quantization step between text encodings and item representations reduce the recommender's over-reliance on text similarity and enable better cross-domain transfer?

Why do recommendation models fail when new users arrive?

Most recommendation algorithms are built assuming all users and items exist at training time. But real platforms constantly see new users and items. Can models be redesigned to handle unseen entities as a structural requirement?

Why do academic recommenders fail when deployed in production?

Academic recommendation models assume static test sets known at training time, but real platforms continuously receive new users, items, and interactions. Understanding this gap reveals what production systems actually need.

Can modeling multiple user personas improve recommendation accuracy?

Single-vector user representations compress all tastes into one place, potentially crowding out minority interests. Can representing users as multiple weighted personas adapt better to what's being scored and produce more accurate predictions?

Can attention mechanisms reveal which user taste explains each recommendation?

Single-vector user models collapse diverse tastes into one representation, losing expressiveness. Can weighting multiple personas by item relevance surface the right taste at the right time while making recommendations traceable?

Conversational Recommenders

13 notes

Does conversation order matter for recommending items in dialogue?

Conversational recommendation systems typically ignore the sequence in which items are mentioned, treating dialogue as a bag of entities. But does the order itself carry predictive signal about what to recommend next?

Can unified policy learning improve conversational recommender systems?

This explores whether formulating attribute-asking, item-recommending, and timing decisions as a single reinforcement learning policy outperforms treating them as separate components. The question matters because joint optimization could improve conversation quality and system scalability.

Can conversational recommenders recover lost preference signals from history?

Conversational recommenders abandoned item and user similarity signals when they shifted to dialogue-focused design. Can integrating historical sessions and look-alike users restore these channels without losing dialogue benefits?

Where does LLM recommendation bias actually come from?

Do conversational AI systems inherit popularity bias from their training data or from the datasets they're deployed on? Understanding the source matters for knowing how to fix it.

Do LLMs in conversational recommendation systems use collaborative or content knowledge?

Conversational recommenders powered by LLMs might rely on either collaborative signals (user interaction patterns) or content/context knowledge (semantic understanding). Understanding which signal dominates would reveal how to design and deploy these systems effectively.

Can LLMs recommend products without ever seeing the catalog?

Explores whether language models can learn to generate effective search queries for recommendation systems without direct access to inventory data. This challenges the intuition that good recommendations require knowing what items exist.

Why do queries and their causes seem semantically different?

Information retrieval systems find passages matching query language, but what if the segment that actually caused a user's question says something quite different? This explores when semantic similarity fails to find causal relevance.

Can language models bridge the gap between critique and preference?

When users express what they dislike rather than what they want, can LLMs reliably transform those critiques into positive preferences that retrieval systems can actually use?

How should LLM-based recommenders retrieve from massive item corpora?

When conversational recommenders need to search millions of items, the LLM cannot memorize the corpus. What retrieval strategies work best under different constraints, and how do they trade off latency, sample efficiency, and scalability?

Can recommendation metrics train language models directly?

Explores whether LLMs can be optimized through closed-loop reinforcement learning using real recommendation system outputs as rewards, rather than relying on expensive proprietary model distillation.

Do conversational recommender benchmarks actually measure recommendation skill?

Conversational recommender systems are evaluated against ground-truth items mentioned later in conversations. But does this metric distinguish between genuinely recommending new items versus simply repeating items users already discussed?

Can review sentiment alignment fix sparse CRS dialogue?

Conversational recommender systems struggle with brief dialogues that lack item-specific detail. Can retrieving reviews that match user sentiment polarity enrich both dialogue context and response generation?

Do recommendation strategies beyond preference questions work better?

What role do sociable conversational moves—opinion sharing, encouragement, credibility signals—play in successful human recommendations, compared to simply asking what someone likes?

Personalized Recommenders

10 notes

Does LLM input augmentation beat direct LLM recommendation?

Can LLMs enrich item descriptions more effectively than making recommendations directly? This explores whether specialized models work better when LLMs focus on what they do best: content understanding rather than ranking.

Does preference data need more raters than examples?

Pairwise preference data violates the i.i.d. assumption because preferences vary across raters. Does this mean PAC bounds for reward models depend on rater diversity rather than just sample size?

Can aggregate reward models satisfy genuinely disagreeing users?

When users have conflicting preferences, do aggregate reward models face an impossible choice between satisfying majorities or sampling proportionally? What does this reveal about RLHF deployment?

Can bandit algorithms beat collaborative filtering for news?

News recommendation faces constant content churn and cold-start users—settings where traditional collaborative filtering struggles. Can a contextual bandit approach like LinUCB explicitly balance exploration and exploitation better than static methods?

Can retrieval enhancement fix explainable recommendations for sparse users?

When users have few historical interactions, embedded recommendation models struggle to generate personalized explanations. Can augmenting sparse histories with retrieved relevant reviews—selected by aspect—overcome this fundamental data limitation?

Can cross-user behavior reveal news relations that individual histories miss?

When a single user's reading history is too sparse for personalized recommendations, can patterns from many users' collective clicking behavior expose hidden connections between articles that no individual user alone could discover?

What dominates AI compute in production systems today?

While public discussion centers on large language models, Facebook's infrastructure data reveals a different story about which AI workloads actually consume the most compute cycles in real production environments.

Can users steer recommendations with natural language at inference?

Can recommendation systems let users specify their preferences in natural language at inference time without retraining? This matters because it would let new users and existing users dynamically adjust what they want to see.

Can one text encoder unify all recommendation tasks?

Does framing diverse recommendation problems—from sequential prediction to review generation—as natural language tasks allow a single model to learn shared structure? Can this approach generalize to unseen items and new task phrasings?

Can friends with different tastes improve recommendations?

Does incorporating social networks through friends' diverse preferences rather than similar tastes lead to better recommendations? This challenges conventional homophily-based approaches that assume friends like the same things.

LLM-Based Recommenders

5 notes

Can LLMs gain collaborative filtering strength without losing text understanding?

LLM recommenders excel at cold-start through text semantics but struggle with warm interactions where collaborative patterns matter most. Can external collaborative models be integrated into LLM reasoning to close this gap?

Why do language models ignore temporal order in ranking?

When LLMs rank items based on interaction history, do they actually use sequence order or treat it as a set? Understanding this gap matters for building effective LLM-based recommenders.

Can LLMs explain recommenders by mimicking their internal states?

Can training language models to align with both a recommender's outputs and its internal embeddings produce explanations that are both faithful and human-readable? This explores whether dual-access interpretation solves the fundamental tension between behavioral accuracy and interpretability.

Do comparisons help users evaluate items better than isolated descriptions?

Can framing product evaluations relationally—by comparing to other items—ground assessment in user reasoning better than absolute descriptions? This matters because recommendation explanations often ask users to do comparison work mentally.

Can item identifiers balance uniqueness and semantic meaning?

Should LLM-based recommenders prioritize distinctive item references or semantic understanding? This explores whether a hybrid approach can overcome the tradeoffs forced by pure ID or pure text indexing.