SYNTHESIS NOTE
Model Architecture and Internals Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Where do cognitive biases in language models come from?

Do LLM biases originate during pretraining or finetuning? Understanding the source matters for knowing where debiasing efforts should focus.

Synthesis note · 2026-04-07 · sourced from Flaws
What kind of thing is an LLM really?

Prior work established that LLMs exhibit systematic cognitive biases analogous to those studied in humans — anchoring, availability, base-rate neglect, confirmation bias, and over 30 others — and that these biases vary across models and are often amplified by instruction tuning. What remained unclear: do these differences originate in pretraining, in finetuning, or in random noise from training stochasticity? The question matters because the answer determines where debiasing efforts should be directed and what to expect from new models.

The Planted in Pretraining, Swayed by Finetuning paper answers this with a two-step causal experimental approach. First, it finetunes models multiple times using different random seeds, measuring how training randomness alone affects bias scores across more than 30 cognitive biases. Second, it introduces cross-tuning — swapping instruction datasets between models to isolate bias sources. If biases were primarily driven by finetuning data, then swapping datasets between models should swap the bias patterns. If biases were primarily driven by the pretrained backbone, then swapping datasets should leave bias patterns largely intact.

The results: while training randomness introduces some variability, biases are mainly shaped by pretraining. Models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. The finetuning dataset modulates existing tendencies but does not create them. Cognitive biases are planted at pretraining and only swayed afterward.

This extends a broader pattern in the vault. Do base models already contain hidden reasoning ability? establishes that reasoning capability is pretraining-determined; RL and finetuning surface what the base model already contains. Does RLVR actually expand what models can reason about? and Why does RLVR work with completely random rewards? extend this to RLVR: the reward signal matters less than the pretraining it activates. Now the same pattern applies to cognitive biases: pretraining sets them, finetuning modulates them. Across reasoning, RLVR, and bias, the finding is the same — post-training is a lever on pretraining, not a source of new structure.

The practical implication is uncomfortable. Do personas make language models reason like biased humans? already documented that prompt-based debiasing fails. This paper explains why: the biases are deeper than the surface at which prompts operate. If cognitive biases are pretraining-deep, then finetuning interventions targeting specific biases will mostly fail — they will dampen the surface expression but leave the underlying structure intact. The bias will reappear under any prompt condition that bypasses the finetuned dampening, which is most conditions. Real debiasing would require intervening at pretraining — filtering training data for the biases present in human-written text — and there is no current mechanism for doing that at scale.

This also reframes what How much poisoned training data survives safety alignment? is measuring. Pretraining poisoning persists not because of the specific data but because pretraining-depth is where behavioral tendencies live. The finetuning-as-cleanup intuition — that alignment training can scrub problems out of pretrained models — is structurally wrong in the same way that finetuning-as-debias is wrong. Both treat post-training as capable of rewriting what pretraining installed. It isn't.

Inquiring lines that use this note as a source 67

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 11

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
20 direct connections · 216 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

cognitive biases in LLMs are mainly shaped by pretraining not finetuning — models sharing a pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data