SYNTHESIS NOTE

Can language models improve themselves without any external training data?

Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.

Synthesis note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback

Self-Questioning Language Models (SQLM) adapts asymmetric self-play from robotic manipulation (OpenAI, 2021) to language domains. Two RL agents: a proposer and a solver. Given only a topic specification (e.g., "algebra word problems"), the proposer generates questions and the solver attempts answers.

The reward structure creates natural difficulty calibration: the proposer is rewarded when problems are neither too easy nor too hard — punished for trivially solvable questions and for impossible ones. The solver is rewarded based on majority voting (sampling multiple solutions and checking consensus), serving as a proxy for correctness without ground-truth labels. For coding tasks, the proposer can generate unit tests, providing direct verifiability.

This creates an automatically calibrated curriculum. The proposer explores the space of possible problems at the frontier of the solver's capability — hard enough to be informative, not so hard as to produce only noise. As the solver improves, the proposer must generate harder problems to maintain its own reward, creating escalating difficulty without human intervention.

The mechanism addresses two fundamental limitations of self-improvement: (a) the need for external training data (the proposer generates all training problems) and (b) the need for external verification (majority voting provides approximate correctness). Both solutions are intrinsic — no human labels, no external reward models, no ground-truth answers.

The key risk inherits from Does self-consistency reliably reward correct answers during training? — the solver's majority-voting reward is the same proxy signal, vulnerable to the same reward hacking. But the proposer provides a natural counterforce: it actively searches for the solver's weaknesses, potentially surfacing problems where majority voting is miscalibrated.

The connection to intrinsic motivation research is direct — curiosity-driven exploration (prediction error, state entropy, Go-Explore) provides the theoretical foundation for why generating novel challenges produces better learning than rehearsing known solutions.

Inquiring lines that use this note as a source 34

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 111 in 2-hop network ·medium cluster Open in graph ↗

Can language models improve themselves without a… Does self-consistency reliably reward correct answ… What limits how much models can improve themselves… Does gradually tightening token budgets beat fixed… Can tree search replace human feedback in LLM trai… Can language models learn skills without human sup…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does self-consistency reliably reward correct answers during training? Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
SQLM inherits the proxy reward risk; the proposer partially mitigates by adversarially targeting weaknesses
What limits how much models can improve themselves? Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
SQLM creates problems in the gap region by design (neither trivial nor impossible)
Does gradually tightening token budgets beat fixed budget training? Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
SQLM creates a natural curriculum but from self-play rather than from budget scheduling
Can tree search replace human feedback in LLM training? Explores whether Monte Carlo Tree Search can generate quality signals for self-improvement without expensive human annotations. Matters because annotation bottlenecks currently limit LLM scaling.
parallel unsupervised self-improvement mechanism: MCTS derives quality signals from tree-search outcomes while asymmetric self-play derives training data from proposer-solver dynamics; both solve the annotation bottleneck but through different structures — MCTS explores within a fixed problem space, self-play generates new problems at the solver's frontier
Can language models learn skills without human supervision? Can a three-role self-play system—Challenger, Reasoner, Judge—bootstrap natural-language skills from raw context alone, without human labels or external reward signals?
extends: same proposer-vs-solver self-play, now with a third neutral Judge and natural-language skills instead of weight updates

Can language models improve themselves without any external training data?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4