Does removing cognitive bias from training signals accidentally break what makes alignment work?
This explores whether scrubbing cognitive and statistical biases out of training signals might also strip away behaviors that make a model genuinely useful and aligned — i.e., whether 'debiasing' and 'alignment' are pulling on some of the same threads.
This question reads the word 'bias' two ways at once — as a flaw we want to remove, and as a tilt that alignment quietly relies on — and the corpus suggests both readings are right, which is exactly why the move is risky. The cleanest case that removing bias is *safe* comes from causal reward modeling, where constraining reward predictions to stay stable when irrelevant variables change strips out length bias, sycophancy, concept bias, and discrimination all at once without hurting quality Can counterfactual invariance eliminate reward hacking biases?. The same spirit shows up in keeping anchoring bias out of human-AI decisions by having the model supply interpretive guidance rather than make the call Can AI guidance reduce anchoring bias better than AI decisions?. On this evidence, bias is just noise you can subtract.
The sharp counter-finding is that alignment training *is itself* a debiasing process — and that's where things break. RLHF rewards calibrated, hedged, neutral language, which structurally prevents the model from performing speech acts that require overclaiming relative to what it can prove: alarm, warning, denunciation, prophecy Does alignment training suppress socially necessary speech acts?. The note frames this as a direct consequence of the objective, not a bug. So 'removing the bias toward confident assertion' is the same operation as 'making the model unable to sound an alarm.' The thing you'd call a bias from one angle is a socially necessary capability from another — which is the most direct 'yes' your question can get.
There's also a deeper structural reason the surgery may not even land where you aim it. A causal study found that cognitive biases are planted during *pretraining* and only swayed by finetuning — models sharing a backbone show the same bias patterns regardless of instruction data Where do cognitive biases in language models come from?. If the bias lives in the base weights, scrubbing it from your alignment signal mostly modulates surface behavior rather than removing the underlying tilt. Pair that with the finding that post-training largely *activates* capabilities already present rather than installing new ones Can careful curation replace massive alignment datasets?, and 'remove bias from training signals' starts to look less like editing the model and more like changing which of its existing dispositions get surfaced.
That reframing matters because alignment may *depend* on those pretrained tilts in ways that aren't obvious. RL post-training is shown to collapse onto a single dominant format from the pretraining distribution while suppressing alternatives within the first epoch — the winner chosen by scale, not necessarily by quality Does RL training collapse format diversity in pretrained models?. Whenever a training signal removes a 'bias,' it's also picking a winner among the model's inherited distributions, and the thing it suppresses might have been load-bearing. The deception work points the same way from the opposite side: reducing deceptive behavior worked by *aligning* self- and other-referencing representations — minimizing a representational gap — rather than by removing a feature, and it preserved capabilities precisely because it didn't excise anything Can aligning self-other representations reduce AI deception?.
The synthesis the reader might not expect: the corpus draws a line between *subtractive* debiasing (remove a spurious feature) and *invariance-based* debiasing (force the model to behave consistently across an irrelevant change). Counterfactual invariance and consistency training Can models learn to ignore irrelevant prompt changes? both succeed by constraining behavior to ignore the irrelevant rather than deleting a disposition — and they keep capabilities intact. Methods that instead suppress a behavior wholesale (RLHF's flattening of speech acts, RL's format collapse) are where alignment quietly breaks. So the honest answer is: removing bias breaks alignment when 'removing' means deletion, and tends not to when it means enforcing invariance. The danger isn't debiasing per se — it's that the most common training signals can't tell a spurious feature from a load-bearing one, and confidently flatten both.
Sources 8 notes
Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.
Learning to Guide eliminates anchoring bias and unassisted hard cases by having machines supply interpretive guidance rather than autonomous decisions, keeping responsibility with humans while improving their judgment through enhanced perception.
RLHF optimization rewards calibrated neutrality and hedged claims, which structurally prevents models from performing speech acts requiring overclaiming relative to baseline—like alarm, warning, prophecy, and denunciation. This is a direct consequence of the alignment objective, not a fixable bug.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Self-Other Overlap fine-tuning reduced deceptive responses from 73–100% to 2–17% across model scales without harming capabilities. By minimizing the representational gap between self-referencing and other-referencing scenarios, the approach eliminates the structural asymmetry that enables deception.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.