Can recommender systems separate true preference from individual rating style bias?
This explores whether a recommender can tell apart what a user actually likes from the quirks of how they personally use the rating scale (one person's 3 stars is another's 5) — and the corpus says the noise runs deeper than rating style alone.
This explores whether a recommender can separate true taste from the idiosyncrasies of how a person rates — and the most direct answer in the collection is sobering: the same user rates the same item differently from one session to the next, sometimes by multiple stars. Why do the same users rate items differently each time? found that explicit ratings mix three things — genuine preference, rater-specific habits (your personal scale), and anchoring effects from whatever you just rated before. So a rating isn't a clean reading of preference; it's preference plus behavior. The unsettling implication is that there's a ceiling on how well any system can recover "true" taste from explicit stars, because the signal itself is partly noise.
Given that, much of the corpus quietly routes around the problem rather than trying to subtract the bias out. Instead of cleaning ratings, systems reframe the user. Can modeling multiple user personas improve recommendation accuracy? and Can attention mechanisms reveal which user taste explains each recommendation? argue a person isn't one taste vector at all but several personas, weighted differently depending on the item in front of them — which means "true preference" was never a single stable thing to isolate in the first place. Others lean on signals that sidestep self-reported scores: Can simpler models beat deep networks for recommendation systems? shows a shallow item-item model beats deep networks by learning which items go together, and Why does multinomial likelihood work better for ranking recommendations? gets state-of-the-art results by treating recommendation as a competition for ranking position rather than as predicting an absolute rating value. Both effectively care about relative preference, where your personal scale offset cancels out.
The deeper twist is that bias in recommenders doesn't only come from how individuals rate — it gets baked in by the system's own machinery. Does embedding dimensionality secretly drive popularity bias in recommenders? shows that an architectural choice (embedding size) silently pushes the model toward popular items, and Why do accuracy-optimized recommenders crowd out minority interests? shows that simply optimizing for accuracy crowds out a user's minority interests. Where do recommendation biases come from in language models? adds that language-model recommenders inherit position, popularity, and fairness biases from pretraining that have nothing to do with the user at all. So even if you perfectly separated rating-style from true preference, the pipeline would re-introduce distortion downstream.
There's also a more radical reframing worth knowing: that ratings aren't a fixed property of a person waiting to be decoded. Do different recommender types shape opinion convergence differently? finds that the recommender itself shapes what people end up rating and how, and Can friends with different tastes improve recommendations? finds the most useful signal isn't taste-similarity at all but friends with *different* tastes nudging you toward anomalous choices. If preference is partly produced by the system rather than merely measured by it, then "separating true preference from rating bias" is the wrong frame — there may be no bias-free ground truth underneath.
The honest synthesis: the corpus doesn't offer a method that cleanly subtracts rating-style bias from true preference. What it offers instead are escape routes — model the user as plural, rank rather than score, and treat preference as something dynamic and partly system-shaped rather than a stable signal buried under noise.
Sources 10 notes
Amatriain et al. found that the same user gives substantially different ratings to the same item across sessions, shifting by multiple stars. This noise stems from temporal inconsistency, rater-specific biases, and anchoring effects—making ratings reflect both preference and rating-behavior rather than stable preference alone.
AMP-CF separates user representation into latent personas weighted by attention to the candidate item. This candidate-conditional approach improves accuracy by adapting the user representation at prediction time and produces inherent explanations for why items were recommended.
AMP-CF represents each user as multiple latent personas weighted dynamically by candidate item. This makes recommendations both diverse and interpretable—each suggestion traces to the specific persona preference it satisfies—without requiring post-hoc reranking.
EASE, a shallow linear item-item weight matrix with diagonal constrained to zero, beats deep neural baselines on most datasets. The constraint forces generalization by forbidding self-prediction, while learned negative weights capture item dissimilarity—a structural prior more valuable than model capacity.
Liang et al. show that switching VAE likelihoods from Gaussian/logistic to multinomial achieves state-of-the-art results because enforced probability competition between items directly aligns training with top-N ranking objectives. Rebalancing KL regularization further improves performance.
Research shows that when user/item embedding dimensions are too small, recommender systems overfit toward popular items to maximize ranking quality. This compounds over time as niche items receive insufficient exposure, and cannot be fixed post-hoc without treating dimensionality as a fairness hyperparameter.
Accuracy-optimized models systematically miscalibrate by over-weighting dominant user interests. A post-processing reranking algorithm that enforces calibration constraints can restore proportional representation without retraining the underlying model.
Wu et al. show that LLM-based recommendation systems exhibit position bias, popularity bias, and fairness bias—unique failure modes stemming from the language model's pretraining objective and corpus demographics rather than interaction data. Mitigation requires LLM-specific approaches, not adapted collaborative filtering techniques.
Research shows that frequently-bought-together and co-viewed recommendation networks produce different opinion convergence patterns. The mechanism: each recommender type attracts different audience segments with different prior expectations, shaping both who sees products together and how they rate them.
Social Poisson Factorization uses friends' diverse tastes to recommend items outside users' usual preferences, outperforming methods that pull friends' representations together. Networks add value through influence on anomalous choices, not taste similarity.