On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

Paper · arXiv 2603.12109 · Published March 12, 2026
Reasoning ArchitecturesReinforcement Learning

Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent’s belief based on collected evidence. We show that deficient AS and BT capabilities will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low-information regime. To resolve the issue, we propose a simple yet effective approach that reallocates the learning signal by injecting easyto-obtain directional critiques to help the agent escape self-locking.

Introduction. Reinforcement learning (RL) with outcome-based rewards has demonstrated great success in improving the reasoning capabilities of Large language models (LLMs) (Wang et al., 2024; Srivastava & Aggarwal, 2025; Xu et al., 2025; Guo et al., 2025). Recently, it has received increasing attention in building agents based on LLMs, where the agent needs to interact with the environment and resolve tasks beyond

Discussion / Conclusion. We study information self-locking (SeL) in long-horizon active reasoning and show that it arises from a structural failure of credit assignment with bidirectional coupling between action selection (AS) and belief tracking (BT). We provide both theoretical and empirical evidence that standard outcome-based RL can be trapped in SeL. We propose AREW, a critique-driven reweighting approach that selectively reallocates optimization signal along trajectories. Experiments demonstrate consistent gains, robustness to noisy critiques, effectiveness on multiple RL mechanisms, and improved training dynamics, AS and BT capabilities across multiple benchmarks. We believe this perspective opens up new directions for designing robust learning mechanism for interactive reasoning agents.