Why does low temperature sampling extract consensus from diverse training data?

This explores why turning the temperature dial down doesn't just make outputs repetitive — it pulls the model toward a kind of majority vote baked into data that came from many disagreeing sources.

This explores why low-temperature sampling acts like a consensus extractor rather than just a repeatability switch. The cleanest answer in the corpus comes from work on models trained across many imperfect experts: when a generative model learns from a crowd of teachers who each carry their own biases, cross-entropy optimization pushes it toward the *center of mass* of their behavior rather than any one teacher's quirks. Low temperature is what makes that center visible — by sampling the highest-probability path, you read off the implicit majority vote, and because individual experts' errors are uncorrelated, that vote denoises them and can outperform every single expert it learned from Can models trained on many imperfect experts outperform each one?. So the consensus isn't created by low temperature; it's *surfaced* by it. The diversity of the training data is exactly what makes the averaged signal trustworthy.

The same logic shows up from the opposite direction in test-time self-improvement. Models can bootstrap on unlabeled data by sampling many answers and rewarding whichever the crowd agrees on — and this works precisely because consensus answers tend to be correct Can models improve themselves using only majority voting?. That's the temperature story in reverse: high-temperature sampling spreads you across the distribution so you can *find* the consensus by voting, where low-temperature sampling collapses you straight onto it. Both lean on the same assumption — that the mode of a distribution learned from diverse sources carries denoised signal.

But consensus has a cost, and the corpus is sharp about it. Pulling toward the agreement point means suppressing everything else. RL post-training does this aggressively, amplifying one dominant format from pretraining within a single epoch while collapsing the alternatives — and the format that 'wins' depends on model scale, not necessarily on being better Does RL training collapse format diversity in pretrained models?. Whether that collapse helps or hurts turns out to be domain-dependent: convergence toward a single answer is a feature when code generation rewards correctness, but a bug when creative writing rewards distinctiveness Does preference tuning always reduce diversity the same way?. The consensus low temperature extracts is only as good as the thing the domain actually wants.

There's also a trap worth knowing about. Consistency is not the same as reliability. Zero temperature and a fixed seed will reproduce the *same* output every time, but that output is still a single draw from the distribution — repeating it 100 times tells you nothing about whether it was a good draw Does setting temperature to zero actually make LLM outputs reliable?. So the consensus you read off the mode is meaningful only when the underlying distribution genuinely encodes denoised agreement (many diverse experts, votable correct answers). When it doesn't, low temperature just gives you a confidently repeated guess. And consensus mechanisms can fail outright at the system level: when you make LLM *agents* negotiate agreement explicitly, they tend to stall out rather than converge, with agreement degrading as the group grows Can LLM agent groups reliably reach consensus together? — a reminder that the implicit statistical consensus inside one model is a very different, and more robust, thing than consensus assembled across many.

The thing you didn't know you wanted to know: low temperature isn't a reliability knob, it's a *readout* knob. It exposes whatever consensus the training distribution already contains — denoised wisdom when the data is diverse and the errors cancel, or a brittle single guess when they don't. The interesting question is never 'should I lower temperature' but 'does my distribution actually have a consensus worth extracting.'

Sources 6 notes

Can models trained on many imperfect experts outperform each one?

Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Can LLM agent groups reliably reach consensus together?

Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.

Why does low temperature sampling extract consensus from diverse training data?

Sources 6 notes

Next inquiring lines