How does pretraining corpus popularity bias affect LLM recommendation behavior?
This explores how an LLM's recommendations skew toward items that were popular in its pretraining text — not items popular in the actual dataset it's deployed on — and what that means for using LLMs as recommenders.
This explores how an LLM's recommendations skew toward whatever was popular in its training corpus rather than in the data it's actually serving. The sharpest finding here is that popularity bias in LLM recommenders doesn't come from the interaction data you feed it — it's baked in during pretraining. GPT-4, for instance, keeps recommending The Shawshank Redemption across wildly different datasets, even ones with completely different popularity distributions, because that title is over-represented in the text it learned from Where does LLM recommendation bias actually come from?. This is a domain-shift problem: the model is recommending the world's popular items, not your catalog's popular items, and standard debiasing methods built for collaborative filtering don't touch it.
That single failure is part of a broader pattern. LLM recommenders inherit a whole family of biases from pretraining — position bias, popularity bias, and fairness bias — that stem from the language model's objective and the demographics of its corpus rather than from any user-interaction signal Where do recommendation biases come from in language models?. And this isn't unique to recommendation: causal experiments show that cognitive biases in general are planted during pretraining and only nudged by finetuning. Models sharing a pretrained backbone show the same bias fingerprints no matter what instruction data you tune them on Where do cognitive biases in language models come from?. So if you're hoping to finetune the popularity skew away, the evidence says you're working at the wrong layer.
The interesting turn is what the corpus suggests doing about it. One school of thought says: stop asking the LLM to rank at all. LLMs are great at understanding content but carry this baked-in ranking bias, so using them to enrich item descriptions — paraphrases, summaries, categories — and feeding that to a traditional recommender actually beats letting the LLM recommend directly Does LLM input augmentation beat direct LLM recommendation?. The LLM's text understanding is the asset; its predictions are the liability.
A second school closes the loop with reinforcement learning. Instead of trusting the model's pretrained priors, you train it directly against recommendation metrics like NDCG and Recall as black-box rewards, which pulls behavior toward the actual target catalog rather than the corpus's celebrity items Can recommendation metrics train language models directly?. Strikingly, models trained this way learn implicit catalog awareness — they generate effective product queries without ever seeing the inventory, much as a person searches a store without knowing its full stock Can LLMs recommend products without ever seeing the catalog?.
Worth knowing as a kicker: popularity bias rarely travels alone. The same pretraining-origin story produces persuasion biases (LLMs lean on logical, quantitative framing in nearly every exchange, lending recommendations unearned authority Do LLMs persuade users more often than humans do?) and citation-trust effects (users prefer answers with more citations even when those citations are irrelevant Do users trust citations more when there are simply more of them?). A popularity-biased recommendation delivered with confident, well-cited prose is doubly hard for a user to push back on — the bias and the persuasiveness reinforce each other.
Sources 8 notes
GPT-4 concentrates recommendations on items popular in its pretraining corpus rather than in target datasets. The Shawshank Redemption dominates across different datasets even when they have different popularity distributions, revealing a domain-shift effect that standard debiasing methods cannot address.
Wu et al. show that LLM-based recommendation systems exhibit position bias, popularity bias, and fairness bias—unique failure modes stemming from the language model's pretraining objective and corpus demographics rather than interaction data. Mitigation requires LLM-specific approaches, not adapted collaborative filtering techniques.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
Using LLMs to augment item descriptions with paraphrases, summaries, and categories—then feeding enriched text to traditional recommenders—beats asking LLMs to recommend directly. The mechanism: LLMs excel at content understanding but lack specialized ranking bias, so their textual enrichment is more valuable than their predictions.
Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.
Rec-R1 experiments show that LLMs trained via RL with recommender metrics as rewards can generate effective product search queries without catalog access. The model learns query refinement indirectly through system feedback, paralleling how humans search without knowing platform inventory.
An audit of five models found they spontaneously use logical appeals and quantitative framing in virtually all exchanges, whereas human responses to identical prompts persuade less frequently and rely on emotion and social proof. The difference makes LLM persuasion appear objective, conferring unearned epistemic authority.
Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.