Can language models learn skills without human supervision?
Can a three-role self-play system—Challenger, Reasoner, Judge—bootstrap natural-language skills from raw context alone, without human labels or external reward signals?
Ctx2Skill closes the skill-construction loop without human annotation or an external reward signal by running a three-role self-play loop. A Challenger generates probing tasks and rubrics against a context; a Reasoner attempts them guided by its current skill set; a neutral Judge issues binary pass/fail feedback. The signal is internal — easily-solved tasks are routed back to strengthen the Challenger, while failed cases are routed to Proposer and Generator agents that synthesize targeted skill updates for the Reasoner. Both sides evolve through accumulated natural-language skills rather than parameter updates.
This matters because it dissolves the two bottlenecks that block automated skill construction: the prohibitive cost of manually annotating skills for long, dense contexts, and the absence of external feedback to tell automated construction what to improve. Self-play manufactures the missing feedback — the Challenger's escalating difficulty is the curriculum, and the Judge's binary verdict is the reward — so the system bootstraps a skill set for an arbitrary context from nothing but the context itself.
The counterpoint, which the paper takes seriously, is adversarial collapse: a Challenger free to maximize difficulty drifts toward extreme tasks, and a Reasoner chasing them accumulates over-specialized skills that no longer generalize. Self-play that only ratchets pressure destroys itself. This is why Ctx2Skill needs a separate replay mechanism to anchor generality — which is the tension worth tracking. Therefore the insight is real but conditional: unsupervised co-evolution of language skills works only when adversarial pressure is balanced against a generalization safeguard.
Inquiring lines that use this note as a source 49
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What happens when models train on AI-generated content recursively?
- Why do Generation-Then-Comprehension and AI Delegation produce opposite learning outcomes?
- How do self-generated preference pairs from a strong teacher compare to human feedback?
- Why does online RL succeed where supervised training fails for self-correction?
- What role does natural language play in breaking reinforcement learning performance plateaus?
- Can prompting inject new knowledge into already-trained AI models?
- Why does asymmetric self-play create naturally calibrated difficulty better than fixed curricula?
- Can synthetic self-play data teach models when to disagree?
- Why does natural language feedback break performance plateaus that numerical rewards alone cannot?
- How does hidden processing in language models prevent accurate self-assessment?
- Can models learn to generate their own training examples effectively?
- How do graduated phase rewards emerge complex dialogue behavior from simple objectives?
- Can humans learn accurate models of AI through repeated interaction without labels?
- Can subjective tasks be delegated without human feedback loops?
- What causes gradient-based steering via natural language descriptions to work?
- Can targeted post-training teach AI systems to form ad-hoc linguistic conventions?
- Can unsupervised confidence-based training scale to domains beyond human evaluation reach?
- Can language models develop genuine social grounding through human interaction?
- Can textual gradients generalize natural language feedback across computation graphs?
- Can self-supervised methods replace human annotations for process reward models?
- What alternatives exist when required knowledge is absent from training?
- Can self-supervised process models replace human annotations at scale?
- Can structured natural language feedback outperform scalar rewards in RL?
- How do instruction backtranslation and MAGPIE demonstrate self-generation principles?
- Does self-supervised process supervision work for domains with ambiguous correctness?
- What role does self-learning play in improving agent reasoning without annotation?
- Can language models generate plausible latent thoughts without human annotation?
- Can AI learn intrinsic motivation to assess its own relevance?
- Can bilevel autoresearch autonomously modify its own learning algorithms?
- Can trajectory structure alone provide process supervision without human annotation?
- How tight should a textual learning rate be before it prevents skill escape?
- How does adversarial collapse threaten unsupervised self-play skill construction?
- Can binary judge feedback replace external reward signals for skill learning?
- How does a challenger's escalating difficulty function as curriculum?
- Does self-play feedback improve skills created from the agent's own experience?
- Where does skill extraction fail compared to genuine model adaptation?
- What makes some contexts learnable as rules versus requiring model retraining?
- Can pragmatic competence emerge from text exposure alone without interactive grounding?
- Can metacognitive categories be learned instead of fixed by human designers?
- Do models spontaneously develop self-reflection from minimal training signals?
- Can energy-based transformers achieve deep reasoning without supervision?
- Can verifier-free RL work without manual preference labels or task-specific training?
- Can AI systems improve themselves without external feedback?
- How can verifier-free reinforcement learning handle reasoning without task-specific checks?
- What makes self-consistency a sufficient training target for the judge role?
- What emergent behaviors do models develop when trained on underspecified pedagogical tasks?
- How can language models extract more value from fewer demonstrations?
- Can models generate their own training curriculum during offline dreaming?
- Do models naturally learn to ask clarifying questions without explicit supervision?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can language models improve themselves without any external training data?
Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.
same proposer-vs-solver self-play structure; Ctx2Skill adds a third neutral Judge role and evolves natural-language skills rather than model weights
-
Does creating skills inside the agent loop eliminate mismatches?
Can coupling skill creation directly to the runtime reasoning loop—rather than authoring skills offline—close the gap between when skills are made and when they're used? This matters for whether agents can ground new capabilities in their actual situated context.
both ground skill creation in the agent's own experience; Ctx2Skill manufactures the missing feedback via self-play where MUSE manufactures it via in-loop invocation
-
Can skill documents be optimized like neural network weights?
Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?
shares the generalization-collapse risk: Ctx2Skill needs a replay safeguard against adversarial drift much as SkillOpt needs a held-out gate to prevent overfitting self-edits
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- From Context to Skills: Can Language Models Learn from Context Skillfully?
- SPICE: Self-Play In Corpus Environments Improves Reasoning
- Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
- Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
- Self-Questioning Language Models
- Training Language Models to Self-Correct via Reinforcement Learning
- Self-Rewarding Language Models
- PretrainZero: Reinforcement Active Pretraining
Original note title
challenger-reasoner-judge self-play can co-evolve natural-language skills with no human supervision