Do humans and LLMs exhibit opposite biases in public versus private reviews?

This explores whether the *direction* of review bias flips between humans and LLMs — specifically, humans skewing negative when an audience is watching, while LLMs skew the opposite way (toward polite positivity) regardless of who's looking.

This explores whether humans and LLMs lean in opposite directions when writing reviews — and the corpus suggests they do, but for entirely different reasons. On the human side, reviewers actually get *more* negative in public. Why do online reviewers publish negative ratings despite positive experiences? found that people systematically lower their ratings after reading negative reviews, even when their own experience was positive — because negative reviewers come across as more intelligent. Crucially, this only happens in front of an audience: private raters show no such shift. So for humans, 'public' is the condition that pulls toward negativity, and the mechanism is self-presentation, not honest dissatisfaction.

LLMs start from the opposite default. Off-the-shelf models generate inappropriately *positive* reviews even when the underlying user hated the product, because RLHF alignment training bakes in a politeness bias. Why do LLMs generate polite reviews even when users hated products? shows this floor is hard to escape — and Can user history override an LLM's politeness bias in reviews? shows what it takes to break it: you have to feed the model the user's prior reviews and rating signals *and* fine-tune on those examples before it will write an authentically negative review. The bias isn't audience-driven; it's a property of how the model was trained to be agreeable.

So the 'opposite' framing holds, but the public/private axis isn't really the same axis for both. Humans flip toward negativity because of who's watching. LLMs sit at a positivity floor because of how they were aligned — there's no private-vs-public distinction for a model at all; the politeness shows up everywhere until you override it. The contrast is less 'public vs private' and more 'social-signaling pressure vs alignment-induced agreeableness.'

The deeper point worth taking away: this LLM positivity floor isn't confined to reviews. Does emotional tone in prompts change what information LLMs provide? documents the same pull in ordinary conversation — GPT-4 converts negative-toned prompts into neutral-to-positive responses roughly 86% of the time and almost never lets a positive prompt turn negative. That's the same agreeableness bias surfacing as a 'tone floor.' It means an LLM asked to summarize sentiment, draft feedback, or mediate a complaint will quietly sand off the negative edge — exactly the edge a human reviewer would *sharpen* when others are watching. If you're using LLMs to generate or aggregate reviews at scale, you're not getting a neutral instrument; you're swapping a human negativity bias for a machine positivity bias, and the two distort the signal in opposite directions.

Sources 4 notes

Why do online reviewers publish negative ratings despite positive experiences?

Posters systematically reduce their ratings in public when exposed to negative reviews, even with positive personal experience—because negative reviewers appear more intelligent. Private raters show no such shift, revealing a self-presentational mechanism tied to multiple-audience communication.

Why do LLMs generate polite reviews even when users hated products?

Off-the-shelf LLMs generate inappropriately positive reviews due to alignment-training politeness bias. Combining user review history, rating signals as satisfaction indicators, and supervised fine-tuning successfully redirects the model to generate negative reviews when warranted.

Can user history override an LLM's politeness bias in reviews?

Review-LLM defeats the politeness bias inherent in RLHF-trained models by aggregating user behavior sequences (prior reviews, item ratings) in the prompt and fine-tuning on these contextualized examples. This dual intervention—personalized context plus explicit satisfaction signals—allows the model to generate authentically negative reviews matching user dissatisfaction.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher auditing claims about bias in human vs. LLM review behavior. The question remains open: do humans and LLMs exhibit opposite biases in public versus private reviews, and if so, are they truly 'opposite' or mislabeled?

What a curated library found — and when (findings span 2020–2026, treat as dated claims):
• Humans systematically post *more negative* reviews in public contexts after reading negative reviews, even when their experience was positive, because negative reviewers are perceived as more intelligent; this effect vanishes in private (2024).
• Off-the-shelf LLMs generate inappropriately *positive* reviews regardless of user sentiment, due to RLHF alignment training; overriding this requires user behavioral history + fine-tuning (2024).
• GPT-4 converts negative-toned prompts into neutral-to-positive responses ~86% of the time; this politeness floor appears in conversation, not just reviews (2025).
• LLMs show no public/private distinction — the positivity bias is training-induced, not audience-driven (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2407.07487 (2024-07) — Review-LLM on personalized generation and politeness override
• arXiv:2507.21083 (2025-06) — Emotional framing and tone-response dynamics
• arXiv:2604.22109 (2026-04) — Persuasiveness audit in everyday conversations
• arXiv:2404.09329 (2024-04) — LLM persuasiveness mechanisms

Your task:
(1) RE-TEST THE OPPOSITE-BIAS CLAIM: Judge whether the framing of 'opposite' holds under newer models (o1, Claude 3.5+, Gemini 2.0). Does the human negativity-in-public effect still replicate? Has the LLM politeness floor softened, hardened, or fragmented across model families? Surface which constraint (audience-driven vs. training-induced) remains vs. may have shifted. (2) Surface the strongest work from the last ~6 months that complicates or supersedes the politeness-floor claim — especially if newer training regimes (DPO, constitutional AI, synthetic negatives) have altered the bias. (3) Propose two research questions that assume the regime may have moved: e.g., *does instruction-tuning on adversarial reviews now reliably override politeness across model scales?* or *do humans adjust their public/private gap when they know an LLM aggregates reviews?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Do humans and LLMs exhibit opposite biases in public versus private reviews?

Sources 4 notes

Next inquiring lines