Choosing the Right Weights: Balancing Value, Strategy, and Noise in Recommender Systems
Many recommender systems are based on optimizing a linear weighting of different user behaviors, such as clicks, likes, shares, etc. Though the choice of weights can have a significant impact, there is little formal study or guidance on how to choose them. We analyze the optimal choice of weights from the perspectives of both users and content producers who strategically respond to the weights. We consider three aspects of user behavior: value-faithfulness (how well a behavior indicates whether the user values the content), strategy-robustness (how hard it is for producers to manipulate the behavior), and noisiness (how much estimation error there is in predicting the behavior). Our theoretical results show that for users, upweighting more value-faithful and less noisy behaviors leads to higher utility, while for producers, upweighting more value-faithful and strategy-robust behaviors leads to higher welfare (and the impact of noise is non-monotonic). Finally, we discuss how our results can help system designers select weights in practice.
Introduction. Most widely-used recommender systems are based on prediction and optimization of multiple behavioral signals. For example, a video platform may predict whether a user will click on a video, how long they will watch it, and whether or not they will give it a thumbs-up. These predictions need to then be aggregated into a final score that items for a user will be ranked by. Typically, the aggregation is done through a linear combination of the different signals. For example, leaked documents from TikTok [Smith, 2021] described the objective for ranking as Pplike) ̈ wlike Ppcommentq ̈ wcomment Erplaytimes ̈ wplaytime ` Ppplayq ̈ wplay. Twitter also recently open-sourced the exact weights on the ten behaviors they use for ranking [Twitter, 2023]. Unfortunately, the chosen weights can often lead to unintended consequences. For example, when Facebook introduced emoji reactions, they gave all emoji reactions a weight five times that of the standard thumbs-up.
Discussion / Conclusion. We analyzed how three aspects of behavior — value-faithfulness, noisiness, and strategy-robustness — affect the optimal weight vector and welfare for users and producers. In practice, the weight vector is chosen based on both performance in A/B tests and qualitative human judgment [Twitter, 2023, Merrill and Oremus, 2021]. Understanding how these three aspects of behavior affect user and producer welfare could help platform designers (typically the product and engineering teams) narrow the search space of weights that are most relevant to test, since a full grid search may become prohibitively expensive as the number of behaviors grows. The exercise of thinking through these aspects may help foresee certain negative side effects, especially when it comes to strategic adaptation, which often takes longer to measure than the typical duration of an A/B test. For instance, on Twitter, producers sometimes post offensive or sensationalized messages that receive few likes but many replies and quote tweets expressing disagreement (a phenomenon called “ratioing” [Troy, 2022, Minot et al., 2021]).