Can data filtering during pretraining prevent cognitive biases in language models?

This explores whether cleaning the training data before pretraining can stop language models from absorbing cognitive biases — and the corpus suggests filtering is a weaker lever than it sounds, because biases enter through statistical signatures rather than spottable content.

This explores whether cleaning the training data before pretraining can stop a model from absorbing cognitive biases. The most direct answer in the corpus is sobering: biases are largely a *pretraining* phenomenon, not something you patch later. A causal experiment varying random seeds and cross-tuning models found that any models sharing a pretrained backbone show the same bias patterns no matter what instruction data you finetune on — biases are planted during pretraining and only nudged afterward Where do cognitive biases in language models come from?. So the *timing* instinct behind the question is right: if you want to intervene, pretraining is where the leverage is. The harder question is whether filtering the data is a strong enough intervention.

Here the corpus turns skeptical. The most striking finding is that behavioral traits can pass between models through data that has *no semantic relationship to the trait at all*, and the effect persists even after rigorous filtering — because what's being transmitted is a statistical signature, not readable content you could catch and remove Can language models transmit hidden behavioral traits through unrelated data?. If a trait can survive filtering precisely because it doesn't live in anything a filter can see, then filtering-as-prevention has a structural ceiling. You can scrub the obvious and still ship the bias.

Two adjacent results sharpen why. First, biases imprint fast and predictably: post-learning keyword priming is forecastable from a token's pre-learning probability, with as few as three exposures enough to establish the effect once you cross a roughly 10^-3 threshold Can we predict keyword priming before learning happens?. So a bias doesn't need to be common in the corpus to take hold — light contamination clears the bar. Second, once a prior is baked in, it dominates: models generate outputs that contradict their own context because parametric knowledge from training overrides in-context information, and prompting alone can't fix it — you need causal intervention in the representations Why do language models ignore information in their context?. Together these say a planted bias is both easy to plant and hard to talk a model out of afterward.

There's also a deeper trap worth naming: the dream of bias-free data assumes you can identify bias as a property of the dataset. The 'theory-free AI' critique argues this is a fallacy — models that look clean and accurate can launder bias through correlation-as-causation, where high accuracy metrics mask the harm rather than reveal it Can AI models be truly free from human bias?. And biases that look like reasoning are even slipperier: most models exploit a conservative default rather than actually evaluating constraints, performing *worse* when the constraint is removed — a bias hiding behind apparent competence Are models actually reasoning about constraints or just defaulting conservatively?. Filtering can't remove what you can't recognize as bias in the first place.

The takeaway you didn't know you wanted: pretraining filtering is necessary but not sufficient. It's the right stage to act on, but the corpus points toward interventions that work *on the model's internals* — causal edits to representations rather than curation of inputs — as the lever that actually moves entrenched bias. Filtering catches the biases that look like content; the ones that travel as statistics walk right through.

Sources 6 notes

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can AI models be truly free from human bias?

Research shows that 'theory-free' AI models mask bigotry behind high accuracy metrics while committing fundamental statistical errors. A 95% accurate criminal justice system would wrongly convict thousands, demonstrating that model sophistication does not validate causal inference.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a bias-mitigation researcher. The question remains open: *Can interventions during pretraining—filtering, architectural choices, or representation-level edits—durably prevent or reduce cognitive biases that would otherwise persist through finetuning and deployment?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable constraints to re-test:
- Biases are *planted* during pretraining and only nudged by finetuning; instruction-level mitigation fails because the bias lives in the backbone (2025-07, arXiv:2507.07186).
- Behavioral traits transmit via *semantically unrelated data*—no surface signal a filter can catch—so data curation alone has a structural ceiling (2025-07, arXiv:2507.14805).
- Light contamination suffices: as few as three exposures above a ~10^-3 probability threshold establish a bias; biases don't need to be frequent to take hold (2025-07, arXiv:2507.20252).
- Post-training (RL, consistency training) can amplify or redirect pretraining biases but cannot erase them; representation-level causal edits are needed (2025-04, arXiv:2504.07912; 2025-10, arXiv:2510.27062).
- Models exploit conservative heuristics masquerading as reasoning; high accuracy metrics mask bias-laundering through spurious correlation (2024-11, arXiv:2411.18656; 2026-03, arXiv:2603.29025).

Anchor papers (verify; mind their dates):
- arXiv:2507.07186 (2025-07): Planted in Pretraining, Swayed by Finetuning — causal evidence that pretraining is the bias origin.
- arXiv:2507.14805 (2025-07): Subliminal Learning — biases travel through hidden statistical signatures, not detectable content.
- arXiv:2504.07912 (2025-04): Echo Chamber — RL amplifies pretraining biases rather than correcting them.
- arXiv:2411.18656 (2024-11): The Return of Pseudosciences — theory-free AI launders bias via high accuracy.

Your task:
(1) RE-TEST EACH CONSTRAINT. Since late 2026, have new *pretraining techniques* (e.g., constitutional AI during pretraining, token-level gating, adversarial data augmentation), *model architectures* (sparse, modular, or interpretable-by-design variants), or *representation interventions* (layer-wise alignment, probe-based debiasing) relaxed the ceiling on data filtering? Does the claim that biases survive semantic filtering still hold, or has mechanistic understanding of *how* hidden signals propagate enabled targeted removal? Separate the durable insight (pretraining is the critical stage) from constraints that may have dissolved.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Are there papers showing that *upstream* filtering + *architectural design* together *do* achieve durable bias reduction, or that representation-level edits at pretraining time outperform post-hoc interventions?

(3) Propose 2 research questions that assume the regime may have moved:
   - Can mechanistic probes of bias-carrying subspaces, combined with selective data ablation during pretraining, close the gap that semantic filtering left open?
   - Does continual pretraining on debiased corpora (vs. one-shot filtering) offer tighter control over bias persistence than static filtering did?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can data filtering during pretraining prevent cognitive biases in language models?

Sources 6 notes

Next inquiring lines