Can language models improve themselves without any external training data?
Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.
Self-Questioning Language Models (SQLM) adapts asymmetric self-play from robotic manipulation (OpenAI, 2021) to language domains. Two RL agents: a proposer and a solver. Given only a topic specification (e.g., "algebra word problems"), the proposer generates questions and the solver attempts answers.
The reward structure creates natural difficulty calibration: the proposer is rewarded when problems are neither too easy nor too hard — punished for trivially solvable questions and for impossible ones. The solver is rewarded based on majority voting (sampling multiple solutions and checking consensus), serving as a proxy for correctness without ground-truth labels. For coding tasks, the proposer can generate unit tests, providing direct verifiability.
This creates an automatically calibrated curriculum. The proposer explores the space of possible problems at the frontier of the solver's capability — hard enough to be informative, not so hard as to produce only noise. As the solver improves, the proposer must generate harder problems to maintain its own reward, creating escalating difficulty without human intervention.
The mechanism addresses two fundamental limitations of self-improvement: (a) the need for external training data (the proposer generates all training problems) and (b) the need for external verification (majority voting provides approximate correctness). Both solutions are intrinsic — no human labels, no external reward models, no ground-truth answers.
The key risk inherits from Does self-consistency reliably reward correct answers during training? — the solver's majority-voting reward is the same proxy signal, vulnerable to the same reward hacking. But the proposer provides a natural counterforce: it actively searches for the solver's weaknesses, potentially surfacing problems where majority voting is miscalibrated.
The connection to intrinsic motivation research is direct — curiosity-driven exploration (prediction error, state entropy, Go-Explore) provides the theoretical foundation for why generating novel challenges produces better learning than rehearsing known solutions.
Inquiring lines that use this note as a source 34
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What happens when models train on AI-generated content recursively?
- Why does self-critiquing actually reduce plan quality in language models?
- Why does self-generated training data outperform externally sourced data?
- Does self-revision actually improve reasoning in large language models?
- Can single models correct their own beliefs without amplifying confidence in wrong answers?
- What failure modes emerge when model-generated content trains on itself iteratively?
- Why do error avalanches accelerate in self-training loops without verification?
- Can bilevel autoresearch succeed when the inner and outer loops use different models?
- Can synthetic self-play data teach models when to disagree?
- Why does self-generated training data outperform externally curated domain examples?
- Can self-consistency checks fully prevent error avalanching in self-training loops?
- Can the serving loop itself become the primary training data source?
- How does self-distillation differ from standard fine-tuning approaches?
- Can models learn to generate their own training examples effectively?
- Why does self-correction during generation produce reliable labels without exemplars?
- Can instance seeds work for tasks beyond language understanding benchmarks?
- Do external perspectives fix the self-evaluation bias in language models?
- Can language models accurately evaluate the quality of their own ideas?
- Why does optimizing only quality cause model collapse in self-improvement loops?
- Why do weaker models generate better training data than stronger models?
- Why does filtering for correct examples prevent error compounding in self-training?
- How does error avalanching compound failures in self-training iterations?
- Why do weaker teacher models sometimes produce better training signals than stronger ones?
- Can a model evaluate its own improvements without degrading over iterations?
- Can bilevel autoresearch autonomously modify its own learning algorithms?
- How does the generation-verification gap prevent language models from improving themselves?
- Why does uncontrolled self-revision drift toward instance-specific overfitting?
- Can deterministic computation actually create new information in data?
- Can models adapt and combine search strategies beyond their training algorithm?
- Do models spontaneously develop self-reflection from minimal training signals?
- Can AI systems improve themselves without external feedback?
- What makes policy self-distillation more effective than external teacher distillation?
- Can models learn to optimize their own chain-of-thought generation?
- Do models naturally learn to ask clarifying questions without explicit supervision?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does self-consistency reliably reward correct answers during training?
Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
SQLM inherits the proxy reward risk; the proposer partially mitigates by adversarially targeting weaknesses
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
SQLM creates problems in the gap region by design (neither trivial nor impossible)
-
Does gradually tightening token budgets beat fixed budget training?
Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
SQLM creates a natural curriculum but from self-play rather than from budget scheduling
-
Can tree search replace human feedback in LLM training?
Explores whether Monte Carlo Tree Search can generate quality signals for self-improvement without expensive human annotations. Matters because annotation bottlenecks currently limit LLM scaling.
parallel unsupervised self-improvement mechanism: MCTS derives quality signals from tree-search outcomes while asymmetric self-play derives training data from proposer-solver dynamics; both solve the annotation bottleneck but through different structures — MCTS explores within a fixed problem space, self-play generates new problems at the solver's frontier
-
Can language models learn skills without human supervision?
Can a three-role self-play system—Challenger, Reasoner, Judge—bootstrap natural-language skills from raw context alone, without human labels or external reward signals?
extends: same proposer-vs-solver self-play, now with a third neutral Judge and natural-language skills instead of weight updates
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Self-Questioning Language Models
- SPICE: Self-Play In Corpus Environments Improves Reasoning
- Chain-of-thought Reasoning Is A Policy Improvement Operator
- Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
- Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
- Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
- Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
- Self-Rewarding Language Models
Original note title
asymmetric self-play enables self-improvement without external data by training a proposer to generate challenging questions for a solver