SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Agentic Systems and Tool Use Model Architecture and Internals

Can language models learn skills without human supervision?

Can a three-role self-play system—Challenger, Reasoner, Judge—bootstrap natural-language skills from raw context alone, without human labels or external reward signals?

Synthesis note · 2026-05-28 · sourced from Context Engineering

Ctx2Skill closes the skill-construction loop without human annotation or an external reward signal by running a three-role self-play loop. A Challenger generates probing tasks and rubrics against a context; a Reasoner attempts them guided by its current skill set; a neutral Judge issues binary pass/fail feedback. The signal is internal — easily-solved tasks are routed back to strengthen the Challenger, while failed cases are routed to Proposer and Generator agents that synthesize targeted skill updates for the Reasoner. Both sides evolve through accumulated natural-language skills rather than parameter updates.

This matters because it dissolves the two bottlenecks that block automated skill construction: the prohibitive cost of manually annotating skills for long, dense contexts, and the absence of external feedback to tell automated construction what to improve. Self-play manufactures the missing feedback — the Challenger's escalating difficulty is the curriculum, and the Judge's binary verdict is the reward — so the system bootstraps a skill set for an arbitrary context from nothing but the context itself.

The counterpoint, which the paper takes seriously, is adversarial collapse: a Challenger free to maximize difficulty drifts toward extreme tasks, and a Reasoner chasing them accumulates over-specialized skills that no longer generalize. Self-play that only ratchets pressure destroys itself. This is why Ctx2Skill needs a separate replay mechanism to anchor generality — which is the tension worth tracking. Therefore the insight is real but conditional: unsupervised co-evolution of language skills works only when adversarial pressure is balanced against a generalization safeguard.

Inquiring lines that use this note as a source 49

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 88 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

challenger-reasoner-judge self-play can co-evolve natural-language skills with no human supervision