INQUIRING LINE

How do self-evolving curricula help RL break beyond base model capability boundaries?

This explores whether a curriculum that generates its own progressively harder tasks can push RL past the ceiling of what the base model already knew — and the corpus is openly split on whether that ceiling can be broken at all.


This explores whether self-evolving curricula let RL break beyond base model capability boundaries — and the first thing to know is that the corpus disagrees about whether those boundaries can be broken at all. One camp argues RL mostly *redeploys* what's already latent: pass@k analysis shows base models matching or beating RLVR-trained models at high k, suggesting RL narrows sampling toward solutions already in the base distribution rather than adding new ones Does RLVR actually expand what models can reason about?, and related work frames verifiable rewards as catalysts that surface pretrained strategies rather than teachers of new ones How does RL training reshape reasoning and what gets lost?, with RL teaching *when* to reason, not *how* Does RL post-training create reasoning or just deploy it?. If that's the whole story, no curriculum can break a boundary, because there's no boundary-crossing to be had.

But the opposing result is exactly where curricula earn their keep. Prolonged RL on *diverse, non-mathematical* tasks — with KL control and policy resetting — produces models that beat the base across all pass@k levels, which the authors read as genuine boundary expansion, not just sampling efficiency Can reinforcement learning discover reasoning strategies base models cannot?. The operative words are 'prolonged' and 'diverse.' A static reward on a fixed task distribution collapses fast: RL converges on a single dominant pretraining format within the first epoch, suppressing alternatives Does RL training collapse format diversity in pretrained models?. A self-evolving curriculum is the mechanism that keeps the task distribution moving faster than the model can collapse onto it — it keeps feeding the model problems just past its current frontier, so there's always something the existing distribution can't already solve.

The deeper reason curricula matter is that *self-improvement alone is provably bounded*. Pure self-improvement stalls on the generation–verification gap, diversity collapse, and reward hacking; every method that actually works smuggles in an external anchor — a past model version, a third-party judge, a user correction, or tool feedback Can models reliably improve themselves without external feedback?, a limit that holds formally, not just empirically What stops large language models from improving themselves?. A self-evolving curriculum is one way to manufacture that external signal continuously: the environment itself becomes the judge. This is why VOYAGER's automatic curriculum works — it pairs an externalized, composable skill library with environmental feedback so the agent keeps exploring and refining instead of forgetting, escaping the catastrophic forgetting of pure weight updates Can agents learn new skills without forgetting old ones?.

The thing you might not have expected: the binding constraint is often the *curriculum's imagination*, not the model's capacity. Agents trained on static expert demonstrations are capped by what the curators imagined and can't learn from their own failures because they never interact with an environment Can agents learn beyond what their training data shows?. A self-evolving curriculum's whole value is that it removes the human imagination ceiling — it generates tasks the curator never thought to write down. And it has to do so carefully: training order mechanically reshapes entropy, with structured tasks draining output entropy while open-ended ones raise it, so scheduling structured-first yields measurable gains and protects creative capability from collapse Does training order reshape how models handle different task types?. That this scales is no longer hypothetical — RL now works in long-horizon, multi-turn settings with delayed rewards, doubling SWE-bench performance Can reinforcement learning scale beyond single-turn language tasks?, exactly the stateful environments where an evolving curriculum has room to run.

So the honest synthesis: a self-evolving curriculum doesn't magically add capability a model could never represent. What it does is keep RL from collapsing onto the base distribution by continuously supplying frontier-pushing tasks and an external grading signal — the two ingredients the 'RL only redeploys' results were missing and the boundary-expansion results happened to include.


Sources 11 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether self-evolving curricula genuinely help RL break base model capability boundaries, or merely redeploy latent capacity. Treat the following as dated claims (2024–2026), not current truth.

What a curated library found — and when (findings span Sept 2024–Apr 2026):
• Base models often match or beat RL-trained models at high pass@k, suggesting RL narrows sampling toward pre-existing solutions rather than discovering new ones (2025-04, arXiv:2504.13837).
• Prolonged RL on diverse, non-mathematical tasks with KL control and policy resetting *does* beat base models across all pass@k levels, interpreted as genuine capability expansion, not efficiency gain (2025-05, arXiv:2505.24864).
• Static reward on fixed task distributions causes RL to collapse onto a single dominant pretraining format within one epoch, suppressing alternatives (2025-04, arXiv:2504.07912).
• Self-improvement alone is formally bounded by generation–verification gap, diversity collapse, and reward hacking; every working method imports external anchors—past models, third-party judges, user corrections, tool feedback (2024-12, arXiv:2412.02674).
• Self-evolving curricula escape this via environment-as-judge; task scheduling (structured-first) mechanically reshapes entropy and measurably protects creative capability (2025-07, arXiv:2507.14783).
• RL now scales to long-horizon, multi-turn SWE tasks with delayed rewards, doubling SWE-bench performance—the stateful regime where curricula have room to operate (2025-08, arXiv:2508.03501).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (2025-04): "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base?"
• arXiv:2505.24864 (2025-05): "ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models"
• arXiv:2507.14783 (2025-07): "Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling"
• arXiv:2508.03501 (2025-08): "Training Long-Context, Multi-Turn Software Engineering Agents with RL"

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 'RL only redeploys' finding: has newer work (last 6 months) on mechanistic interpretability, model probing, or emergent capability detection shown that pass@k equivalence at high k actually masks hidden capability expansion? Test the 'prolonged + diverse' pathway: do newer curricula beat ProRL's results, or does the diversity ceiling hold? Surface plainly where constraints still appear to bind.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Focus on tension: do any recent papers undermine the 'external anchor' necessity claim, or show pure self-improvement evading the formal bound?
(3) Propose 2 research questions that ASSUME the regime has moved: (a) If curricula do break boundaries, what is the mechanistic signature that distinguishes novel reasoning from redeplyed reasoning? (b) What curriculum design minimizes task-distribution imagination ceiling without external oracle?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines