SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals Training, RL, and Test-Time Scaling

Can generative and discriminative models reach agreement?

Generative and discriminative decoding often produce conflicting answers. Can a game-theoretic framework force these two complementary procedures to reconcile their predictions into a single, more reliable output?

Synthesis note · 2026-02-22 · sourced from Question Answer Search
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Language models offer two fundamentally different ways to answer questions. Generatively: sample the most probable answer. Discriminatively: score candidate answers and pick the best. These two procedures often disagree — generative decoding fails when probability mass spreads across contradicting answers; discriminative decoding fails due to miscalibration or sensitivity to question wording. Both are noisy, and their noise is not correlated.

The Consensus Game formalizes this as a regularized imperfect-information sequential signaling game. A Generator agent must communicate an abstract correct/incorrect value to a Discriminator agent, but can only do so using natural language strings from a candidate set. An effective joint policy is one where both agents agree on which strings map to "correct." The resulting decoding algorithm — Equilibrium-Ranking — finds approximate equilibria of this game.

The results are striking: LLaMA-7B with Equilibrium-Ranking outperforms LLaMA-65B and PaLM-540B on multiple benchmarks spanning reading comprehension, commonsense reasoning, mathematical problem-solving, and dialogue. A 7B model matching a 540B model is a ~77x parameter efficiency gain.

The insight is that generative and discriminative procedures contain complementary information. Neither alone captures the model's "best guess at the truth." The game-theoretic framework extracts a consensus signal that is more reliable than either procedure individually — analogous to how ensemble methods combine weak learners, but operating within a single model's two modes of operation.

This is a training-free method — no fine-tuning required. The computational overhead comes from finding the equilibrium at inference time, making it a form of test-time compute scaling. Since Can inference compute replace scaling up model size?, Equilibrium-Ranking provides a concrete mechanism: the test-time compute goes into reconciling the model's own internal disagreements rather than generating longer reasoning chains.

The connection to multi-agent debate is suggestive. Since Why do multi-agent LLM systems converge without genuine deliberation?, the Consensus Game forces genuine deliberation between two perspectives (generative and discriminative) within a single model — the equilibrium constraint prevents premature convergence because both agents must independently arrive at consistent signals. And since When does debate actually improve reasoning accuracy?, the Consensus Game sidesteps the evidence-verification problem that plagues inter-model debate: both "agents" operate within the same model's knowledge, so there is no risk of one agent persuading the other with rhetorically superior but factually wrong arguments -- the equilibrium constraint forces agreement on what the model actually knows rather than what it can argue most convincingly.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 162 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

game-theoretic equilibrium between generative and discriminative LM decoding reconciles their inconsistent predictions — small models with consensus match models 100x larger