SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can models improve themselves using only majority voting?

Explores whether test-time reinforcement learning can generate effective reward signals from unlabeled data by treating majority-voted answers as pseudo-labels, and whether this bootstrapping approach actually drives meaningful policy improvement.

Synthesis note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The standard assumption in RL for LLMs is that ground-truth labels or a trained reward model are required. TTRL (Test-Time Reinforcement Learning) challenges this: by using majority voting across repeated samples as the reward signal, the model can train on unlabeled data at test time.

The logic is elegant: if you sample a question many times and a particular answer emerges as the majority, it's likely to be correct. That majority answer can be used as a pseudo-label for generating reward signals. The reward isn't perfect, but it's surprisingly effective — consistent enough to drive genuine policy improvement.

This opens a path toward model self-evolution that doesn't depend on human annotation or pre-trained reward models. The model uses its own inference-time behavior (its tendency to agree with itself) as a training signal. This is a form of bootstrapping: test-time compute enables reward estimation, which enables training, which improves the model.

The economic implication: as real-world tasks increase in complexity, large-scale annotation for RL becomes impractical. TTRL's approach to reward estimation from unlabeled data becomes increasingly important as a scaling strategy.

Inquiring lines that use this note as a source 44

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 137 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

test-time rl on unlabeled data is possible using majority-vote reward estimation