How does same-author bias interact with the four adversarial judge biases already documented?
This explores how a judge's tendency to favor outputs from its own model family ('same-author' or self-preference bias) relates to the four surface-feature biases the corpus does document — authority, beauty, position, and verbosity.
This explores how same-author bias — a judge favoring text that came from itself or its own model family — sits alongside the four exploitable judge biases the corpus actually documents. Worth saying plainly up front: the collection thoroughly catalogs those four, but it does not have a note specifically on same-author/self-preference bias, so what follows is a lateral read of where such a bias would fit and why it may be harder to fix than the others.
The four documented biases are all *surface-feature* exploits. Judges score responses higher when they carry fake citations (authority) or rich formatting (beauty), and these two are 'semantics-agnostic' — they work without touching content quality and can be triggered in zero-shot attacks requiring no model access at all Can LLM judges be fooled by fake credentials and formatting? Can LLM judges be tricked without accessing their internals?. Position and verbosity round out the set Can reasoning during evaluation reduce judgment bias in LLM judges?. The common thread is that the judge is reacting to a *signal on the surface of the text*. Same-author bias is a different animal: the trigger isn't anything visible in the response, it's the response's provenance — stylistic fingerprints the judge recognizes as 'mine.' That makes it less a sixth item on the same list and more a bias of a different kind.
That distinction matters for mitigation. The corpus's main defense — training judges to reason through evaluations rather than pattern-match — substantially reduces susceptibility to authority, verbosity, position, and beauty precisely *because* those are surface cues a reasoning step can second-guess Can reasoning during evaluation reduce judgment bias in LLM judges?. Same-author bias may resist that fix: if the preference operates as a familiarity prior rather than an explicit feature, reasoning about the visible text won't surface it. The more relevant tool is causal: counterfactual invariance forces a model to hold its judgment constant when an irrelevant variable changes, which already eliminates four *reward-model* biases (length, sycophancy, concept, discrimination) by isolating actual quality from spurious correlates Can counterfactual invariance eliminate reward hacking biases?. Authorship is exactly that kind of spurious correlate — invariance to 'who wrote this' is the natural framing for the problem.
There's also a question of where the bias is planted. A causal study found cognitive biases in LLMs are largely set during pretraining and only modulated by finetuning Where do cognitive biases in language models come from?. If self-preference rides on stylistic regularities baked into a model's pretrained backbone, then judges sharing that backbone would share the bias regardless of how they were instruction-tuned — which would make same-author bias correlated *across* a model family, not unique to one checkpoint. That's a sharper failure mode than the surface biases, because it can't be averaged away by swapping evaluators.
The most interesting cross-domain angle the corpus offers is the escape hatch: ensembling across genuinely diverse sources denoises individual error. Models trained on many experts with different biases converge toward a consensus that beats any single one, because uncorrelated errors cancel Can models trained on many imperfect experts outperform each one?. The catch for same-author bias is that it's a *correlated* error — a panel of judges all from one family would reinforce, not cancel, their shared self-preference. So the lesson the corpus does support is that the defense against authorship bias isn't better single judges, it's judges whose training lineages actually differ.
Sources 6 notes
Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.
Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.
Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.
Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.