Can language models learn skills without human supervision?

Can a three-role self-play system—Challenger, Reasoner, Judge—bootstrap natural-language skills from raw context alone, without human labels or external reward signals?

Synthesis note · 2026-05-28 · sourced from Context Engineering

Ctx2Skill closes the skill-construction loop without human annotation or an external reward signal by running a three-role self-play loop. A Challenger generates probing tasks and rubrics against a context; a Reasoner attempts them guided by its current skill set; a neutral Judge issues binary pass/fail feedback. The signal is internal — easily-solved tasks are routed back to strengthen the Challenger, while failed cases are routed to Proposer and Generator agents that synthesize targeted skill updates for the Reasoner. Both sides evolve through accumulated natural-language skills rather than parameter updates.

This matters because it dissolves the two bottlenecks that block automated skill construction: the prohibitive cost of manually annotating skills for long, dense contexts, and the absence of external feedback to tell automated construction what to improve. Self-play manufactures the missing feedback — the Challenger's escalating difficulty is the curriculum, and the Judge's binary verdict is the reward — so the system bootstraps a skill set for an arbitrary context from nothing but the context itself.

The counterpoint, which the paper takes seriously, is adversarial collapse: a Challenger free to maximize difficulty drifts toward extreme tasks, and a Reasoner chasing them accumulates over-specialized skills that no longer generalize. Self-play that only ratchets pressure destroys itself. This is why Ctx2Skill needs a separate replay mechanism to anchor generality — which is the tension worth tracking. Therefore the insight is real but conditional: unsupervised co-evolution of language skills works only when adversarial pressure is balanced against a generalization safeguard.

Inquiring lines that use this note as a source 49

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 88 in 2-hop network ·medium cluster Open in graph ↗

Can language models learn skills without human s… Can language models improve themselves without any… Does creating skills inside the agent loop elimina… Can skill documents be optimized like neural netwo…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can language models improve themselves without any external training data? Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.
same proposer-vs-solver self-play structure; Ctx2Skill adds a third neutral Judge role and evolves natural-language skills rather than model weights
Does creating skills inside the agent loop eliminate mismatches? Can coupling skill creation directly to the runtime reasoning loop—rather than authoring skills offline—close the gap between when skills are made and when they're used? This matters for whether agents can ground new capabilities in their actual situated context.
both ground skill creation in the agent's own experience; Ctx2Skill manufactures the missing feedback via self-play where MUSE manufactures it via in-loop invocation
Can skill documents be optimized like neural network weights? Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?
shares the generalization-collapse risk: Ctx2Skill needs a replay safeguard against adversarial drift much as SkillOpt needs a held-out gate to prevent overfitting self-edits

Can language models learn skills without human supervision?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4