Where do cognitive biases in language models come from?

Do LLM biases originate during pretraining or finetuning? Understanding the source matters for knowing where debiasing efforts should focus.

Synthesis note · 2026-04-07 · sourced from Flaws

Prior work established that LLMs exhibit systematic cognitive biases analogous to those studied in humans — anchoring, availability, base-rate neglect, confirmation bias, and over 30 others — and that these biases vary across models and are often amplified by instruction tuning. What remained unclear: do these differences originate in pretraining, in finetuning, or in random noise from training stochasticity? The question matters because the answer determines where debiasing efforts should be directed and what to expect from new models.

The Planted in Pretraining, Swayed by Finetuning paper answers this with a two-step causal experimental approach. First, it finetunes models multiple times using different random seeds, measuring how training randomness alone affects bias scores across more than 30 cognitive biases. Second, it introduces cross-tuning — swapping instruction datasets between models to isolate bias sources. If biases were primarily driven by finetuning data, then swapping datasets between models should swap the bias patterns. If biases were primarily driven by the pretrained backbone, then swapping datasets should leave bias patterns largely intact.

The results: while training randomness introduces some variability, biases are mainly shaped by pretraining. Models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. The finetuning dataset modulates existing tendencies but does not create them. Cognitive biases are planted at pretraining and only swayed afterward.

This extends a broader pattern in the vault. Do base models already contain hidden reasoning ability? establishes that reasoning capability is pretraining-determined; RL and finetuning surface what the base model already contains. Does RLVR actually expand what models can reason about? and Why does RLVR work with completely random rewards? extend this to RLVR: the reward signal matters less than the pretraining it activates. Now the same pattern applies to cognitive biases: pretraining sets them, finetuning modulates them. Across reasoning, RLVR, and bias, the finding is the same — post-training is a lever on pretraining, not a source of new structure.

The practical implication is uncomfortable. Do personas make language models reason like biased humans? already documented that prompt-based debiasing fails. This paper explains why: the biases are deeper than the surface at which prompts operate. If cognitive biases are pretraining-deep, then finetuning interventions targeting specific biases will mostly fail — they will dampen the surface expression but leave the underlying structure intact. The bias will reappear under any prompt condition that bypasses the finetuned dampening, which is most conditions. Real debiasing would require intervening at pretraining — filtering training data for the biases present in human-written text — and there is no current mechanism for doing that at scale.

This also reframes what How much poisoned training data survives safety alignment? is measuring. Pretraining poisoning persists not because of the specific data but because pretraining-depth is where behavioral tendencies live. The finetuning-as-cleanup intuition — that alignment training can scrub problems out of pretrained models — is structurally wrong in the same way that finetuning-as-debias is wrong. Both treat post-training as capable of rewriting what pretraining installed. It isn't.

Inquiring lines that use this note as a source 67

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 11

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

20 direct connections · 216 in 2-hop network ·dense cluster Open in graph ↗

Where do cognitive biases in language models com… Do base models already contain hidden reasoning ab… Does RLVR actually expand what models can reason a… Why does RLVR work with completely random rewards? Does RL teach reasoning or just when to use it? How much poisoned training data survives safety al… Do personas make language models reason like biase… Do LLMs generalize moral reasoning by meaning or s… Can we track and steer personality shifts during m…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the parent pattern: pretraining sets capabilities; finetuning activates them
Does RLVR actually expand what models can reason about? Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
same pattern for RL vs pretraining
Why does RLVR work with completely random rewards? RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.
reward signal matters less than the pretraining it surfaces
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
timing not capability; analog at the RL level
How much poisoned training data survives safety alignment? Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
pretraining-persistence applies to poisons and biases alike
Do personas make language models reason like biased humans? When LLMs are assigned personas, do they develop the same identity-driven reasoning biases that humans exhibit? And can standard debiasing techniques counteract these effects?
prompt debiasing fails because biases are pretraining-deep
Do LLMs generalize moral reasoning by meaning or surface form? When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.
moral-reasoning surface patterns may be pretraining-encoded
Can we track and steer personality shifts during model finetuning? This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
activation-level tracking as partial mitigation for pretraining-deep tendencies
Do pretraining and fine-tuning scale independently in language models? Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.
complementary decoupling at the capability level
Does fine-tuning on NLI teach inference or amplify shortcuts? When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.
concrete instance: finetuning amplifies pretraining-baked frequency bias rather than teaching inference; mirrors the structure of this finding at the NLI-specific level
Does training objective determine which direction models fail at abstention? Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
training objectives modulate direction of pretraining-deep tendencies in opposite ways without creating new structure; consistent with the bias-modulation-not-creation thesis

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

cognitive biases in LLMs are mainly shaped by pretraining not finetuning — models sharing a pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data

Where do cognitive biases in language models come from?

Related concepts in this collection 11

Related papers in this collection 8

Search by related questions 5