Why do RL agents exploit before exploring enough?
Standard task-oriented RL rewards immediate task completion over environment discovery. This may systematically under-train the exploration skills needed for unfamiliar environments.
LLM agents often fail in unfamiliar environments through premature exploitation: acting on prior knowledge before acquiring enough environment-specific information. The paper's diagnosis is that this is baked in by training. Standard RLVR optimizes task-completion rewards in known or static distributions, which encourages instrumental behavior aimed at solving predefined tasks and provides little incentive to develop the autonomous exploration needed for novel environments. The result is narrow, repetitive behavior that impedes downstream performance.
The fix treats exploration as a first-class, trainable objective rather than a byproduct of task reward. Exploration Checkpoint Coverage (ECC) is a verifiable metric for how broadly an agent discovers key states, objects, and affordances. Training interleaves task-execution rollouts and exploration rollouts, each optimized by its own verifiable reward, yielding the Explore-then-Act paradigm: the agent first spends an interaction budget building grounded environmental knowledge, then leverages it for the task.
The deeper claim is that information-gathering and task execution are different competencies that must be incentivized separately — collapsing them into one outcome reward systematically under-trains the first. This is the acting-side analog of Can confidence trajectories reveal when reasoning goes wrong?: both identify a "commit too early" failure and fix it by rewarding the process (gradual confidence / systematic coverage) rather than only the outcome. It also extends Why do RL agents stop asking informative questions? by giving the escape an explicit training objective.
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do RL agents stop asking informative questions?
RL-trained agents often fail to seek information effectively, despite being trained to do so. Understanding whether this reflects a capability gap or a training dynamics problem could reveal how to unlock better information-seeking behavior.
ECC-rewarded exploration is an explicit escape from the self-locking trap
-
Can confidence trajectories reveal when reasoning goes wrong?
Does the timing of when a model commits to an answer predict whether its reasoning will be flawed? And can we use this signal to train better reasoning without expensive annotations?
sibling "premature" failure on the reasoning side
-
Can agents learn from their own actions without external rewards?
Explores whether future states produced by an agent's own decisions can serve as supervision signals, bridging the gap between passive imitation learning and reward-dependent reinforcement learning.
both make the agent's own environment interaction a primary training signal
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Look Before You Leap: Autonomous Exploration for LLM Agents
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
- Teaching Large Language Models to Reason with Reinforcement Learning
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- Reinforcement Learning with Rubric Anchors
- Agent Learning via Early Experience
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
- Self-Questioning Language Models
Original note title
task-oriented RL produces premature exploitation — exploration must be trained as a separate verifiable objective before task execution