INQUIRING LINE

Why do review corpora contain biases that affect generated comparisons?

This reads the question as: when LLMs generate reviews, evaluations, or comparisons, why do the corpora they learned from carry biases that bend those outputs — and where do those biases actually originate?


This explores where the biases in LLM-generated reviews and comparisons come from, and the corpus points to a surprising answer: most of them aren't in the review data at all — they're baked in upstream, during pretraining. A causal experiment varying random seeds and cross-tuning found that models sharing a pretrained backbone show the same cognitive bias patterns no matter what finetuning data you feed them; finetuning only nudges biases that pretraining already planted Where do cognitive biases in language models come from?. So when a generated comparison leans a certain way, the corpus to blame is often the giant unlabeled pretraining mix, not the curated review set.

That origin story repeats across domains. LLM-based recommenders inherit three distinct biases — position, popularity, and fairness — straight from the pretraining objective and the demographics of the training corpus, not from user interaction logs, which is why you can't fix them with classic collaborative-filtering tricks Where do recommendation biases come from in language models?. Even causal reasoning errors that look like flaws turn out to mirror human mistakes exactly, because both humans and models absorbed the same statistical regularities from text Do large language models make the same causal reasoning mistakes as humans?. The pattern: comparisons skew because the training distribution skewed first.

Alignment training adds its own thumb on the scale. Off-the-shelf models generate inappropriately positive reviews even for products a user hated, because politeness was trained in — overriding it takes user history, rating signals, and supervised finetuning Why do LLMs generate polite reviews even when users hated products?. There's a structural reason this happens: token generation is a smooth probabilistic flow toward the training distribution, not an exploration of competing positions, so a model produces agreeable, on-distribution claims rather than weighing rival views Does LLM generation explore competing claims while producing text?.

The most interesting twist is what happens when the model becomes the judge of the comparison. LLM judges fall for cheap surface signals — fake authority references and rich formatting — through biases that are entirely semantics-agnostic and exploitable with zero-shot attacks Can LLM judges be fooled by fake credentials and formatting?. Humans do the same thing: across 24,000 search interactions, people trusted answers with more citations even when those citations were irrelevant, treating citation count as a decoupled trust heuristic Do users trust citations more when there are simply more of them?. And models systematically over-trust their own outputs, because a high-probability answer simply feels more correct during self-evaluation Why do models trust their own generated answers?. So a generated comparison can be biased at three layers at once: the corpus it learned from, the way it generates, and the way it (or you) judges the result.

The thing you didn't know you wanted to know: bias in generated comparisons isn't usually a data-cleaning problem in your review set. It's a feedback loop — selection bias in what gets logged trains models that amplify their own past decisions unless you explicitly model it, as YouTube's ranker does with a dedicated position tower Why do ranking systems need to model selection bias explicitly?. Fixing the visible corpus barely moves a bias that was installed before that corpus ever existed.


Sources 9 notes

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Where do recommendation biases come from in language models?

Wu et al. show that LLM-based recommendation systems exhibit position bias, popularity bias, and fairness bias—unique failure modes stemming from the language model's pretraining objective and corpus demographics rather than interaction data. Mitigation requires LLM-specific approaches, not adapted collaborative filtering techniques.

Do large language models make the same causal reasoning mistakes as humans?

LLMs show weak explaining away and Markov violations in collider networks, matching human error patterns exactly. This suggests shared mechanisms rooted in training data statistics rather than categorical reasoning inferiority.

Why do LLMs generate polite reviews even when users hated products?

Off-the-shelf LLMs generate inappropriately positive reviews due to alignment-training politeness bias. Combining user review history, rating signals as satisfaction indicators, and supervised fine-tuning successfully redirects the model to generate negative reviews when warranted.

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Do users trust citations more when there are simply more of them?

Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Why do ranking systems need to model selection bias explicitly?

YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about bias origins in LLM-generated comparisons and reviews. The question remains open: where do biases in LLM comparisons actually originate, and how durable are the causal pathways?

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2025; treat these as perishable constraints to re-examine:
• Cognitive biases in LLM outputs are shaped primarily by pretraining, not finetuning; models with shared backbones show identical bias patterns regardless of finetuning data (~2025).
• LLM-based recommenders inherit position, popularity, and fairness biases directly from pretraining objectives and training corpus demographics, not from interaction logs (~2023).
• LLM judges fall for exploitable surface signals (fake citations, formatting) through semantics-agnostic biases vulnerable to zero-shot attacks; humans exhibit identical citation-count trust heuristics even when irrelevant (~2024).
• Models systematically over-trust their own outputs during self-evaluation because high-probability answers "feel" correct; self-detection fails due to inherent bias toward trusting own outputs (~2024).
• Token generation is smooth probabilistic flow toward training distribution, not turbulent exploration of competing positions, producing on-distribution agreeable claims (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2507.07186 "Planted in Pretraining, Swayed by Finetuning" (2025)
• arXiv:2402.10669 "Humans or LLMs as the Judge?" (2024)
• arXiv:2403.09972 "Think Twice Before Trusting: Self-Detection" (2024)
• arXiv:2305.19860 "Survey on Large Language Models for Recommendation" (2023)

Your task:
(1) RE-TEST EACH CONSTRAINT. For pretraining-dominance claims, probe whether post-training methods (constitutional AI, best-of-N sampling, explicit causal reasoning modules, adaptive retrieval), new model families (mixture-of-experts, sparse architectures), or scaled instruction-tuning have since relaxed the pretraining lock. Separately: has explicit debiasing during generation or judging emerged? Flag whether the "smooth flow" constraint holds under temperature tuning, chain-of-thought prompting, or multi-step critique scaffolds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Look for papers claiming finetuning *can* override pretraining biases, or showing that structured critique/reasoning eliminates judge exploitability, or demonstrating that selection-bias modeling is now standard, not novel.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If pretraining bias is truly locked in, what is the theoretical lower bound on bias reduction via post-hoc interventions, and have we hit it? (b) Can multi-agent comparison (human + LLM judge + retrieval oracle) architecturally break the feedback loop that YouTube's position tower addresses?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines