INQUIRING LINE

Can curriculum approaches teach agents when to stop exploring?

This explores whether curriculum learning — staging tasks from easy to hard, or stretching how long an agent gets to act — can teach agents the judgment of when enough exploration is enough, rather than just how to explore more.


This explores whether curriculum learning — staging tasks from easy to hard, or gradually lengthening how long an agent gets to act — can teach agents the judgment of *when to stop* exploring, not just how to explore harder. The corpus doesn't have a paper that names this exact problem, but it has the pieces that frame it, and they pull in two opposing directions worth seeing side by side.

The strongest direct evidence that curriculum shapes exploration is in test-time interaction scaling: training agents with a curriculum that grows the number of environment steps they're allowed to take produces state-of-the-art web agents, because longer rollouts let them explore, backtrack, and replan in ways that per-step reasoning can't Does agent interaction time scale separately from reasoning depth?. Here the curriculum is teaching the agent how to *spend* an exploration budget — and implicitly, where to stop spending it. VOYAGER shows a second flavor: an *automatic* curriculum that keeps proposing new goals drives continual exploration, while a skill library lets the agent bank what it learns instead of re-deriving it Can agents learn new skills without forgetting old ones?. In both cases the curriculum is the thing regulating the explore/exploit rhythm.

But here's the twist the corpus surfaces: the more you optimize an agent with reinforcement learning, the *worse* its sense of when to keep exploring gets. RL training compresses behavioral diversity in search agents through entropy collapse — policies converge onto a few narrow reward-maximizing moves and stop probing alternatives Does reinforcement learning squeeze exploration diversity in search agents?. So a naive curriculum that just rewards success can teach an agent to stop exploring *too early*, locking it into a comfortable strategy. The fix that paper points to — preserving diversity through demonstrations — is itself a kind of curriculum design choice.

The opposite failure is just as real: exploring forever along one path. Abstractions that enforce breadth-first exploration outperform pouring all your compute into deeper and deeper single chains, precisely because depth-only reasoning runs into an 'underthinking' failure where the agent keeps going without ever stepping back Can abstractions guide exploration better than depth alone?. Read against the entropy-collapse note, you get the real shape of 'when to stop': it's a balance between collapsing too soon and drilling too long, and the structure you train against — abstractions, interaction budgets, diversity-preserving data — is what tunes that balance.

One boundary worth knowing: curriculum can only teach this if the agent actually gets to act and fail. Agents trained purely on static expert demonstrations never interact with an environment, so their competence — including any sense of when to quit exploring — is capped by whatever scenarios the curators imagined, not learned from experience Can agents learn beyond what their training data shows?. So the honest answer is: yes, curriculum approaches can shape *when* an agent stops exploring — but only the interactive, reward-shaped kind, and only if they're explicitly designed to fight entropy collapse on one side and runaway depth on the other.


Sources 5 notes

Does agent interaction time scale separately from reasoning depth?

Test-time interaction—increasing environment steps—enables exploration, backtracking, and replanning that per-step reasoning cannot achieve. Curriculum-based RL on rollout length produces SOTA web agents, showing interaction scaling dominates on tasks with partial observability.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question is: *Can curriculum approaches teach agents when to stop exploring?* — still open, especially as interactive agent architectures evolve.

What a curated library found — and when (dated claims, not current truth):
Findings span May 2025–May 2026. A library of agent research surfaces:
• Test-time interaction scaling (growing rollout budgets via curriculum) produces state-of-the-art web agents; longer exploration windows let agents backtrack and replan, implicitly teaching stopping points (2025-06).
• RL training on agents compresses behavioral diversity through entropy collapse — policies lock into narrow reward-maximizing moves and stop probing alternatives; diversity-preserving curricula (e.g., via demonstrations) counter this (2025-07, 2026-05).
• Breadth-first exploration abstractions outperform depth-only reasoning; depth-only chains lead to 'underthinking' failures where agents keep going without stepping back (2025-05).
• Static expert demonstration curricula cap agent competence to curator-imagined scenarios; no interactive failure teaches the stopping judgment (2025-10).
• Skill libraries and continuous goal refinement (e.g., VOYAGER-style) let agents bank learned skills, regulating the explore/exploit rhythm (2026-04, 2026-05).

Anchor papers (verify; mind their dates):
• arXiv:2506.07976 (Thinking vs. Doing: Test-Time Interaction Scaling, 2025-06)
• arXiv:2605.22817 (Vector Policy Optimization: Training for Diversity, 2026-05)
• arXiv:2510.08558 (Agent Learning via Early Experience, 2025-10)
• arXiv:2604.08377 (SkillClaw: Collective Skill Evolution, 2026-04)

Your task:
(1) RE-TEST EACH CONSTRAINT. For entropy collapse: check whether recent multi-agent orchestration, mixture-of-experts routing, or intrinsic motivation architectures have since RELAXED the diversity-loss problem. For interaction scaling: test whether newer harnesses (e.g., 2026-05 SDK work) or caching strategies change the cost/benefit calculus of longer rollouts. For static demonstrations: probe whether in-context learning or retrieval-augmented agent design now allows demonstration sets to adapt without retraining. Flag what still holds and what has shifted.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months (Q1–Q2 2026). Look for papers that either dissolve the stopping-judgment problem through architectural novelty (e.g., learned termination critics, adaptive budgeting) or reveal it as a false dichotomy.

(3) Propose 2 research questions that ASSUME the regime may have moved:
   – Can learned termination critics (trained on held-out exploration tasks) generalize the stopping judgment to new domains without curriculum redesign?
   – Does externalization (memory, skill artifacts, protocol codification) reduce the agent's need for fine-grained curriculum tuning, by outsourcing the explore/exploit rhythm to the harness itself?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines