Calibrated Recommendations

Paper · Source
Recommender Architectures

When a user has watched, say, 70 romance movies and 30 action movies, then it is reasonable to expect the personalized list of recommended movies to be comprised of about 70% romance and 30% action movies as well. This important property is known as calibration, and recently received renewed attention in the context of fairness in machine learning. In the recommended list of items, calibration ensures that the various (past) areas of interest of a user are reflected with their corresponding proportions. Calibration is especially important in light of the fact that recommender systems optimized toward accuracy (e.g., ranking metrics) in the usual offline-setting can easily lead to recommendations where the lesser interests of a user get crowded out by the user’s main interests– which we show empirically as well as in thought-experiments. This can be prevented by calibrated recommendations. To this end, we outline metrics for quantifying the degree of calibration, as well as a simple yet effective re-ranking algorithm for post-processing the output of recommender systems.

Introduction. Recommender systems provide a personalized user experience in many different application domains, including online-shopping, social-networks and music/video streaming. In this paper, we show that recommender systems trained toward accuracy (e.g., ranking metrics) can easily generate lists of recommended items that focus on the main areas of interest of a user–while the user’s lesser areas of interest tend to be underrepresented or even absent. Over time, such unbalanced recommendations carry the risk of gradually narrowing down the user’s areas of interest–which is similar to the effect of echo chambers or filter bubbles. This problem also applies to the case of several users sharing the same account, where the interests of the less active users within the same account may get crowded out in the recommendations. We demonstrate this effect in several thought experiments in Section 2 as well as in experiments on real-world data in Section 6.

Discussion / Conclusion. In this paper, we showed that recommender systems that are trained toward accuracy in the typical offline-setting may generate unbalanced recommendations, especially when the available training data are limited or noisy. We motivated the importance of calibration as an additional objective besides recommendation-accuracy. We outlined established metrics for quantifying the degree of calibration. It is desirable that they are particularly sensitive to discrepancies regarding the lesser areas of interest of a user, especially when such an area of interest is completely missing from the recommended list. Moreover, we presented a simple yet effective greedy algorithm, and outlined an optimality-guarantee due to submodular functions. These approaches can be applied for post-processing the recommendation-lists generated by recommender systems. We also discussed the difference to diversity in its typical sense of minimal similarity or redundancy among the recommended items.