SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Agentic Systems and Tool Use

Why does majority voting outperform more complex inference methods?

Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?

Synthesis note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

For reasoning models, majority voting across independent samples is a surprisingly strong baseline that sophisticated inference-time methods struggle to beat. Think Deep, Think Fast finds it generally competitive with or outperforming Best-of-N (which requires an external reward model) and sequential revision methods (which require the model to self-evaluate).

The robustness comes from what majority voting doesn't do: it doesn't require a verifier (which can be wrong), it doesn't require self-assessment (which reasoning models are poor at), and it doesn't rely on trace length (which is negatively correlated with correctness). It just exploits statistical redundancy across independent samples.

This doesn't mean majority voting is optimal — it's a ceiling-limited strategy. But it's the right default: simple, interpretable, and hard to beat without investing significantly in verifier quality. The research implication is that gains from more complex methods should be benchmarked against majority voting, not against single-sample baselines. Many reported improvements in the literature may not survive this comparison.

Extreme decomposition + voting at million-step scale (MAKER): The MAKER framework pushes majority voting to its logical extreme by decomposing complex tasks into atomic subtasks executed by microagents, each validated by voting. At scale (1000+ steps), this achieves error-free execution that no single-agent approach matches. MAKER also reveals scaling laws for multi-agent systems: more agents improve performance on complex tasks but hurt simple tasks (communication overhead exceeds benefit), and there's a critical complexity threshold below which single agents dominate. This extends the majority-voting baseline finding: voting's robustness is not just a property of independent sampling at the problem level — it works at every level of decomposition, from whole-problem voting down to atomic-subtask voting. The practical implication: when individual subtask accuracy is high (>95%), voting over decomposed subtasks compounds reliability multiplicatively. See Can extreme task decomposition enable reliable execution at million-step scale?.

Inquiring lines that use this note as a source 11

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 162 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

majority voting is more robust than best-of-n and sequential revisions for reasoning models