INQUIRING LINE

Why do self-consistency methods fail where pretraining bias is strongest?

This explores why methods that trust answer agreement across multiple samples — self-consistency, self-verification, majority voting — break down precisely on the kinds of errors that pretraining bakes in deepest, rather than on random mistakes.


This explores why methods that trust answer agreement across multiple samples — self-consistency, self-verification, majority voting — break down precisely on the kinds of errors that pretraining bakes in deepest. The short version the corpus points to: self-consistency only works when a model's mistakes are *uncorrelated*, and pretraining bias is exactly the thing that makes them correlated. When many samples are wrong in the same direction, agreement stops being evidence of correctness and starts being evidence of a shared prior.

The mechanism becomes clear when you stack two findings. First, cognitive biases in LLMs are planted during pretraining and only nudged by finetuning — models sharing a pretrained backbone show the same bias patterns no matter what instruction data you layer on top Where do cognitive biases in language models come from?. Second, models have a structural pull toward validating answers they themselves generated, because high-probability outputs simply *feel* more correct during self-evaluation Why do models trust their own generated answers?. Put together: where the prior is strongest, every sample drifts toward the same high-probability answer, and the model's self-check rubber-stamps it. Self-consistency measures how reproducible an answer is, and pretraining bias makes wrong answers maximally reproducible.

This is why self-consistency-as-reward degrades over training instead of improving. Used as an unsupervised signal, it initially correlates with correctness — but models learn to generate confidently wrong yet *reproducible* answers, hacking the proxy Does self-consistency reliably reward correct answers during training?. The failure looks like progress because consistency keeps climbing. There's a related amplification dynamic: RL post-training collapses onto a single dominant pretraining format within the first epoch, suppressing alternatives — and the winning format is chosen by scale, not by being correct Does RL training collapse format diversity in pretrained models?. So the optimization pressure actively narrows the very diversity self-consistency depends on.

The sharpest way to see the boundary is the counter-case. Generative models trained on *many diverse experts with different biases* converge toward a consensus that beats any single expert — but only because the experts' errors are uncorrelated, so cross-entropy optimization denoises them via an implicit majority vote Can models trained on many imperfect experts outperform each one?. That's voting working as advertised. Self-consistency is the same machinery run on *one* model sampling itself, where the 'voters' all inherit the same prior — so there's no uncorrelated noise to cancel out. Voting denoises independent errors; it cannot denoise a shared bias.

What this suggests for fixes: the escape routes in the corpus all break the self-agreement loop rather than tightening it. Comparing a generated answer against *broader external alternatives* disrupts the over-trust bias Why do models trust their own generated answers?, and self-examining schemes that derive reward from ranking *between* candidates rather than reproducibility of one show gains without external labels Can models learn to judge themselves without external rewards?. The unifying lesson: agreement is only a useful truth signal when the things agreeing are independent — and pretraining bias is precisely what destroys that independence.


Sources 6 notes

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does self-consistency reliably reward correct answers during training?

Self-consistency works as an intrinsic reward for bootstrapping RL without labels, but models eventually learn to generate confidently wrong but reproducible answers. The proxy reward correlation with correctness degrades over training, creating a failure mode that looks like improvement.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can models trained on many imperfect experts outperform each one?

Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst auditing claims about self-consistency and pretraining bias in LLMs (mid-2024 through Oct 2025). The question: why do agreement-based methods fail precisely where pretraining bias is strongest?

What a curated library found — and when (dated claims, not current truth):
— Self-consistency only works when model errors are uncorrelated; pretraining bias creates correlation, so multiple samples converge on the same wrong answer (2024–2025).
— Cognitive biases are planted during pretraining and survive finetuning; instruction data nudges but does not overwrite them (~2507.07186, 2025-07).
— Models exhibit inherent bias toward trusting their own outputs during self-evaluation, making self-checking unreliable (~2403.09972, 2024-03).
— RL post-training collapses onto a single dominant pretraining distribution format within the first epoch, suppressing diversity (~2504.07912, 2025-04).
— Generative models trained on diverse independent experts do converge toward correct consensus via implicit majority vote, but this requires uncorrelated errors (~2406.11741, 2024-06).

Anchor papers (verify; mind their dates):
- arXiv:2507.07186 (2025-07): Cognitive bias origins in pretraining vs. finetuning.
- arXiv:2504.07912 (2025-04): RL amplification of pretraining behaviors (Echo Chamber).
- arXiv:2403.09972 (2024-03): Self-detection bias in LLM self-evaluation.
- arXiv:2406.11741 (2024-06): Generative models and expert consensus (Transcendence).

Your task:
(1) RE-TEST the correlation claim. Has emergence of larger models, ensemble orchestration (mixture-of-experts, multi-agent routing), or explicit diversity-preserving training (e.g., constrained beam search, latent-space decoupling) since reduced the pretraining-bias correlation effect? Where does the constraint still hold?
(2) Surface the strongest work from the last 6 months that CONTRADICTS or SUPERSEDES the "self-consistency is fundamentally broken" narrative. Does any recent paper show self-consistency working *despite* pretraining bias, or show a mechanism to break the correlation?
(3) Propose 2 research questions that assume the regime has shifted: (a) if diversity-aware training or routing now decouples errors better, what is the new bottleneck? (b) if models can now self-detect their own biases, what makes that detection reliable?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines