SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can language models improve themselves without any external training data?

Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.

Synthesis note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time?

Self-Questioning Language Models (SQLM) adapts asymmetric self-play from robotic manipulation (OpenAI, 2021) to language domains. Two RL agents: a proposer and a solver. Given only a topic specification (e.g., "algebra word problems"), the proposer generates questions and the solver attempts answers.

The reward structure creates natural difficulty calibration: the proposer is rewarded when problems are neither too easy nor too hard — punished for trivially solvable questions and for impossible ones. The solver is rewarded based on majority voting (sampling multiple solutions and checking consensus), serving as a proxy for correctness without ground-truth labels. For coding tasks, the proposer can generate unit tests, providing direct verifiability.

This creates an automatically calibrated curriculum. The proposer explores the space of possible problems at the frontier of the solver's capability — hard enough to be informative, not so hard as to produce only noise. As the solver improves, the proposer must generate harder problems to maintain its own reward, creating escalating difficulty without human intervention.

The mechanism addresses two fundamental limitations of self-improvement: (a) the need for external training data (the proposer generates all training problems) and (b) the need for external verification (majority voting provides approximate correctness). Both solutions are intrinsic — no human labels, no external reward models, no ground-truth answers.

The key risk inherits from Does self-consistency reliably reward correct answers during training? — the solver's majority-voting reward is the same proxy signal, vulnerable to the same reward hacking. But the proposer provides a natural counterforce: it actively searches for the solver's weaknesses, potentially surfacing problems where majority voting is miscalibrated.

The connection to intrinsic motivation research is direct — curiosity-driven exploration (prediction error, state entropy, Go-Explore) provides the theoretical foundation for why generating novel challenges produces better learning than rehearsing known solutions.

Inquiring lines that use this note as a source 34

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 111 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

asymmetric self-play enables self-improvement without external data by training a proposer to generate challenging questions for a solver