Where do cognitive biases in language models come from?
Do LLM biases originate during pretraining or finetuning? Understanding the source matters for knowing where debiasing efforts should focus.
Prior work established that LLMs exhibit systematic cognitive biases analogous to those studied in humans — anchoring, availability, base-rate neglect, confirmation bias, and over 30 others — and that these biases vary across models and are often amplified by instruction tuning. What remained unclear: do these differences originate in pretraining, in finetuning, or in random noise from training stochasticity? The question matters because the answer determines where debiasing efforts should be directed and what to expect from new models.
The Planted in Pretraining, Swayed by Finetuning paper answers this with a two-step causal experimental approach. First, it finetunes models multiple times using different random seeds, measuring how training randomness alone affects bias scores across more than 30 cognitive biases. Second, it introduces cross-tuning — swapping instruction datasets between models to isolate bias sources. If biases were primarily driven by finetuning data, then swapping datasets between models should swap the bias patterns. If biases were primarily driven by the pretrained backbone, then swapping datasets should leave bias patterns largely intact.
The results: while training randomness introduces some variability, biases are mainly shaped by pretraining. Models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. The finetuning dataset modulates existing tendencies but does not create them. Cognitive biases are planted at pretraining and only swayed afterward.
This extends a broader pattern in the vault. Do base models already contain hidden reasoning ability? establishes that reasoning capability is pretraining-determined; RL and finetuning surface what the base model already contains. Does RLVR actually expand what models can reason about? and Why does RLVR work with completely random rewards? extend this to RLVR: the reward signal matters less than the pretraining it activates. Now the same pattern applies to cognitive biases: pretraining sets them, finetuning modulates them. Across reasoning, RLVR, and bias, the finding is the same — post-training is a lever on pretraining, not a source of new structure.
The practical implication is uncomfortable. Do personas make language models reason like biased humans? already documented that prompt-based debiasing fails. This paper explains why: the biases are deeper than the surface at which prompts operate. If cognitive biases are pretraining-deep, then finetuning interventions targeting specific biases will mostly fail — they will dampen the surface expression but leave the underlying structure intact. The bias will reappear under any prompt condition that bypasses the finetuned dampening, which is most conditions. Real debiasing would require intervening at pretraining — filtering training data for the biases present in human-written text — and there is no current mechanism for doing that at scale.
This also reframes what How much poisoned training data survives safety alignment? is measuring. Pretraining poisoning persists not because of the specific data but because pretraining-depth is where behavioral tendencies live. The finetuning-as-cleanup intuition — that alignment training can scrub problems out of pretrained models — is structurally wrong in the same way that finetuning-as-debias is wrong. Both treat post-training as capable of rewriting what pretraining installed. It isn't.
Inquiring lines that use this note as a source 67
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can dataset-level debiasing methods fix popularity bias inherited from pretraining?
- How do LLM biases manifest differently across the three paradigms?
- How does prompt iteration reinforce user bias without empirical anchoring?
- Can prompt-based debiasing overcome entrenched LLM model priors?
- How do different LLM integration paradigms affect inheritance of pretraining biases?
- How do pretraining biases interact differently with prompts across model tiers?
- How do position bias and popularity bias interact with sequence order blindness?
- Can prompting strategies eliminate systematic biases without shuffling or aggregation?
- How does pretraining corpus popularity bias affect LLM recommendation behavior?
- How do LLM biases reflect social classification schemas rather than random errors?
- What calibration corrections can reduce LLM judge bias in automated evaluation pipelines?
- How does same-author bias interact with the four adversarial judge biases already documented?
- Does RLHF politeness bias manifest as sycophancy in other LLM tasks?
- Do language models inherit gender bias from training data in grading tasks?
- Do language models exhibit the same causal biases that humans show?
- Can counterfactual invariance techniques address exploitable biases in LLM judges?
- How do citizen assembly preferences reduce LLM political bias?
- Why do review corpora contain biases that affect generated comparisons?
- How do early layers preserve unbiased information while late layers conform?
- What circuit mechanisms produce belief bias in syllogistic reasoning?
- How does demo position create spatial bias in prompts?
- How does rhetorical familiarity bias models toward their own arguments?
- What role does attention structure play in creating position bias?
- Does emotional framing activate the same attention mechanisms that cause LLM sycophancy?
- How does tone sensitivity create systematic informational bias in model responses?
- Why does hypothesis attestation bias exist separately from frequency bias in NLI?
- Why does NLI fine-tuning amplify frequency bias instead of teaching inference?
- Does fine-tuning on NLI tasks amplify or reduce frequency bias in language models?
- Why does optimism bias disappear when LLMs passively observe outcomes?
- How does this motivational bias connect to LLMs' causal reasoning failures?
- Do language models show the same truth bias as humans?
- Why do primacy effects peak at specific instruction densities?
- Why does keyword priming require only three training exposures to establish?
- Does keyword priming explain why pre-training poisoning persists through alignment?
- Can priming from different facts interfere with each other in the same model?
- What mechanism makes keyword probability the strongest predictor of priming?
- Why do transformer attention patterns show positional and sequential bias across tasks?
- Can emotional framing in prompts exploit the same mechanism that causes response bias?
- Does fine-tuning on NLI tasks reduce or amplify frequency bias?
- Why do LLMs show gender bias but humans evaluators do not?
- Do external perspectives fix the self-evaluation bias in language models?
- Why do LLMs inherit causal biases from their training data?
- Why do LLM judges show more extreme sycophancy bias than humans?
- How do preference models amplify human cognitive biases into systematic miscalibration?
- Does removing cognitive bias from training signals accidentally break what makes alignment work?
- What role does inductive bias play versus model capacity in practice?
- Why do self-consistency methods fail where pretraining bias is strongest?
- Can humans suppress frequency bias through attention and intention?
- Does debiasing training data actually solve the bias problem in machine learning?
- Does attention bias explain grounding failure in language models?
- What other evaluation biases exist in LLM judge systems?
- How much noise comes from rater idiosyncrasy versus selection bias?
- Does representational density emerge from training data exposure during pretraining?
- Can implicit association tests reveal LLM biases beneath trained responses?
- What are the consequences of stacked accommodation biases in LLM predictions?
- Can prompt-based debiasing work if biases are embedded in pretraining?
- Can data filtering during pretraining prevent cognitive biases in language models?
- Do newer LLM generations create worse detector bias through increased linguistic divergence?
- How does transformer attention bias toward repeated and context-prominent content?
- What structural biases does transformer attention have before training?
- What biases do single large LLM judges introduce into comparisons?
- Why do unified models still inherit data-distribution biases from training?
- Does alignment compound cultural bias that started during pretraining?
- How does typicality bias in human annotation affect downstream model behavior?
- What biases might an LLM judge introduce into an on-policy alignment process?
- Does latent density emerge during pretraining from training data familiarity?
- How does Western-dominance bias propagate through multimodal training data?
Related concepts in this collection 11
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the parent pattern: pretraining sets capabilities; finetuning activates them
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
same pattern for RL vs pretraining
-
Why does RLVR work with completely random rewards?
RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.
reward signal matters less than the pretraining it surfaces
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
timing not capability; analog at the RL level
-
How much poisoned training data survives safety alignment?
Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
pretraining-persistence applies to poisons and biases alike
-
Do personas make language models reason like biased humans?
When LLMs are assigned personas, do they develop the same identity-driven reasoning biases that humans exhibit? And can standard debiasing techniques counteract these effects?
prompt debiasing fails because biases are pretraining-deep
-
Do LLMs generalize moral reasoning by meaning or surface form?
When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.
moral-reasoning surface patterns may be pretraining-encoded
-
Can we track and steer personality shifts during model finetuning?
This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
activation-level tracking as partial mitigation for pretraining-deep tendencies
-
Do pretraining and fine-tuning scale independently in language models?
Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.
complementary decoupling at the capability level
-
Does fine-tuning on NLI teach inference or amplify shortcuts?
When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.
concrete instance: finetuning amplifies pretraining-baked frequency bias rather than teaching inference; mirrors the structure of this finding at the NLI-specific level
-
Does training objective determine which direction models fail at abstention?
Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
training objectives modulate direction of pretraining-deep tendencies in opposite ways without creating new structure; consistent with the bias-modulation-not-creation thesis
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Language models show human-like content effects on reasoning tasks
- On the Reasoning Capacity of AI Models and How to Quantify It
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- Premise Order Matters in Reasoning with Large Language Models
- Are Emergent Abilities in Large Language Models just In-Context Learning?
Original note title
cognitive biases in LLMs are mainly shaped by pretraining not finetuning — models sharing a pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data