Do language models inherit gender bias from training data in grading tasks?

This explores whether LLMs acting as graders or evaluators carry forward social biases (like gender) baked in during training — and the corpus has no paper on gender-in-grading specifically, but a lot on where evaluator bias comes from and how it surfaces in judging tasks.

This reads the question as two linked claims: that grading models can be biased, and that the bias traces back to training data rather than to how the model is later tuned for the grading job. Let me be straight up front — nothing in this collection studies gender bias in grading head-on. But several notes converge on the more general mechanism, and together they make a strong case for 'yes, and the root is earlier than you'd think.'

The load-bearing finding is that biases are *planted in pretraining, not finetuning*. A causal experiment varying random seeds and cross-tuning models found that models sharing a pretrained backbone show the same bias patterns no matter what instruction data you finetune them on — finetuning only nudges biases that pretraining already installed Where do cognitive biases in language models come from?. So if a grading model inherits a skew from its corpus, you can't reliably fine-tune it away; the bias lives below the layer you're adjusting. A related note shows why: when a model's parametric priors are strong, in-context instructions (like 'grade this fairly, ignore the author') often fail to override them, and only direct intervention in the representations works Why do language models ignore information in their context?.

The sharpest material is on LLMs used *as judges*, which is exactly the grading scenario. One study catalogs four exploitable biases in LLM judges — authority and beauty biases that are 'semantics-agnostic,' meaning the judge rewards fake credentials and rich formatting regardless of actual content Can LLM judges be fooled by fake credentials and formatting?. If a judge can be swayed by superficial signals it learned to associate with quality, it's a short conceptual step to it being swayed by demographic signals it learned to associate with quality. Another note shows judges defaulting to a *conservative bias* — appearing to reason while actually just picking the safer option — which means a grader's apparent fairness may be a default, not a judgment Are models actually reasoning about constraints or just defaulting conservatively?.

The most transferable insight comes from two calibration studies. LLMs overestimate how often irony appears because ironic examples are *more salient* in training text than in real use Do language models overestimate how often irony appears?, and models perform worse on historical legal cases because recent cases are over-represented in the corpus Why do language models struggle with historical legal cases?. Both are the same machine that would produce gender bias in grading: whatever associations are frequent or salient in training data become the model's default expectations, and those expectations leak into any evaluative task. Add the 'artificial hivemind' finding — that different models converge on near-identical outputs because they share overlapping training data and alignment Do different AI models actually produce diverse outputs? — and you get a worrying corollary: swapping graders to a different model may not cancel the bias, because they inherited the same corpus.

The thing you may not have known you wanted to know: the fix points are not where you'd expect. Because the bias is a pretraining prior, prompt-level fairness instructions are weak, and because grading is a judgment task, the bias often hides behind plausible-looking reasoning rather than announcing itself. The corpus suggests the real levers are representation-level intervention and consistency training that forces identical outputs across irrelevant input changes Can models learn to ignore irrelevant prompt changes? — teaching a grader to score the same essay identically whether the author reads as male or female would be the same shape of solution.

Sources 8 notes

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do language models overestimate how often irony appears?

GPT-4o assigns significantly higher irony scores than humans (p < .001), revealing that LLMs detect irony as a pattern but miscalibrate its prevalence because ironic examples are more salient in training data than in actual use.

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Do language models inherit gender bias from training data in grading tasks?

Sources 8 notes

Next inquiring lines