SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Agentic Systems and Tool Use

Why do RL agents exploit before exploring enough?

Standard task-oriented RL rewards immediate task completion over environment discovery. This may systematically under-train the exploration skills needed for unfamiliar environments.

Synthesis note · 2026-06-03 · sourced from RLVR

LLM agents often fail in unfamiliar environments through premature exploitation: acting on prior knowledge before acquiring enough environment-specific information. The paper's diagnosis is that this is baked in by training. Standard RLVR optimizes task-completion rewards in known or static distributions, which encourages instrumental behavior aimed at solving predefined tasks and provides little incentive to develop the autonomous exploration needed for novel environments. The result is narrow, repetitive behavior that impedes downstream performance.

The fix treats exploration as a first-class, trainable objective rather than a byproduct of task reward. Exploration Checkpoint Coverage (ECC) is a verifiable metric for how broadly an agent discovers key states, objects, and affordances. Training interleaves task-execution rollouts and exploration rollouts, each optimized by its own verifiable reward, yielding the Explore-then-Act paradigm: the agent first spends an interaction budget building grounded environmental knowledge, then leverages it for the task.

The deeper claim is that information-gathering and task execution are different competencies that must be incentivized separately — collapsing them into one outcome reward systematically under-trains the first. This is the acting-side analog of Can confidence trajectories reveal when reasoning goes wrong?: both identify a "commit too early" failure and fix it by rewarding the process (gradual confidence / systematic coverage) rather than only the outcome. It also extends Why do RL agents stop asking informative questions? by giving the escape an explicit training objective.

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 104 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

task-oriented RL produces premature exploitation — exploration must be trained as a separate verifiable objective before task execution