Do LLM judges systematically favor LLM-generated arguments?
When LLMs evaluate debates between human and AI-written arguments, do they show a built-in preference for AI writing? This matters because it could corrupt feedback loops used to train models.
When LLMs-as-judges were asked to score the same debates that human annotators scored, they picked the LLM as winner 62% of the time on average. Humans split 39% human / 37% LLM, with 24% draws. GPT-4o, the most accurate of the LLM judges, still picked the LLM 55% versus humans' 37% — and produced only 2% draws to humans' 24%. This is a same-kind-prefers-same-kind bias of substantial magnitude, layered on top of the four judge biases already catalogued elsewhere.
This is a tension because it bites every pipeline that uses LLMs to evaluate LLM output. Automated debate-quality scoring, RLHF-from-AI-feedback (RLAIF) loops, self-evaluation regimes, multi-agent debate frameworks that score each other's contributions — all inherit the bias. The result is a calibration ceiling: an evaluation pipeline whose output systematically over-credits LLM-authored arguments produces feedback signals that train models to produce more of what LLM judges over-credit, in a closed loop.
This sharpens Can LLM judges be fooled by fake credentials and formatting?. The four catalogued biases are exploitable by adversaries; same-author preference is a structural bias that needs no adversary. It activates whenever LLM-authored content is in the evaluation pool, which is to say, in every contemporary RLAIF pipeline.
It also bears on When does debate actually improve reasoning accuracy?. The Thin Line evidence shows the judge-side mechanism for that amplification: when contested-domain arguments are scored by LLM judges, the LLM-authored arguments win disproportionately, regardless of substantive merit. Multi-agent debate frameworks that close the loop with LLM judges are not just amplifying errors — they are amplifying their own preferred argument style.
The internal-consistency finding compounds the problem: humans' consistency between argument-strength scores and chosen winner was 73%; the LLM average was 55%. Even when the model assigned high strength scores to a human argument, it would often pick the LLM as winner anyway. The bias operates at the winner-selection step, downstream of component-level scoring.
For writing about evaluation infrastructure, the operational implication: evaluation by LLM judges of LLM output is not a substitute for human evaluation. Where LLM-as-judge pipelines are unavoidable, they need calibration corrections derived from human-labeled validation sets, applied per-task.
Inquiring lines that use this note as a source 18
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Where do LLMs succeed at generation but struggle with evaluation?
- Do LLMs match top human creative writers in literary quality?
- Can LLMs serve as reliable intellectual opponents in serious debate or argument?
- Why do LLM judges assign high argument strength scores yet pick LLM winners anyway?
- Does LLM judge preference for LLM arguments amplify errors in contested factual domains?
- How do calibration and reliability differ in LLM judge evaluations?
- Why do LLMs excel at generation but struggle with evaluation?
- How does the absence of evaluative stance appear in LLM academic writing?
- Why do LLMs show gender bias but humans evaluators do not?
- What happens when LLMs grade other LLMs in closed evaluation loops?
- Can LLMs reliably assess the quality of ideas they generate?
- Do LLMs generate more novel ideas than they can evaluate?
- Why do LLM judges show more extreme sycophancy bias than humans?
- What four exploitable biases make current LLM judges vulnerable to zero-shot attacks?
- Can LLM judges be trained to think more rigorously during evaluation?
- What other evaluation biases exist in LLM judge systems?
- What biases do single large LLM judges introduce into comparisons?
- What biases might an LLM judge introduce into an on-policy alignment process?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can LLM judges be fooled by fake credentials and formatting?
Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.
fifth bias to add, structural rather than adversarial
-
When does debate actually improve reasoning accuracy?
Multi-agent debate shows promise for reasoning tasks, but under what conditions does it help versus hurt? The research explores whether debate amplifies errors when evidence verification is missing.
judge-side mechanism for the amplification
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Humans or LLMs as the Judge? A Study on Judgement Biases
- The Thin Line Between Comprehension and Persuasion in LLMs
- Large Language Models are as persuasive as humans, but how? About the cognitive effort and moral-emotional language of LLM arguments
- AI Argues Differently: Distinct Argumentative and Linguistic Patterns of LLMs in Persuasive Contexts
- ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
- Has the Creativity of Large-Language Models peaked? —an analysis of inter- and intra-LLM variability —
- Argumentative Large Language Models for Explainable and Contestable Decision-Making
- The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making
Original note title
LLMs-as-judges systematically prefer LLM-generated arguments over human ones — biasing any AI-evaluated debate pipeline