Does RL training follow a predictable two-phase learning sequence?

This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning

Across eight text-only and vision-language models, RL training reveals a consistently two-phase dynamic. In the first phase, the learning bottleneck is procedural correctness — a single calculation error invalidates an entire solution, creating powerful gradient signal that compels mastery of low-level execution tokens (arithmetic, variable substitution, formula application). In the second phase, the bottleneck shifts to strategic planning — exploring and mastering high-level planning tokens (deduction like "we can use the fact that," branching like "let's try a different approach," backtracing like "but the problem mentions that").

The phases are not mutually exclusive. Procedural refinement continues throughout training. But the primary driver of marginal performance gains shifts to strategic planning. This is why the "aha moment" phenomenon appears when it does — it represents the discovery and internalization of high-level reasoning strategies, which only becomes the active learning frontier after procedural skills are consolidated.

The entropy dynamics tell the same story. Planning tokens show increasing strategic diversification over training — the model explores new ways to combine established skills. Execution tokens show stable conditional entropy — once arithmetic is mastered, there's little incentive to find diverse ways to perform it. The performance improvement comes from discovering new combinations of established skills, which is the core function of planning.

This insight exposes a core inefficiency in algorithms like GRPO that apply optimization pressure uniformly across all tokens. If the learning frontier is in planning tokens but gradient signal is diluted across execution tokens, optimization is wasteful. HICRA addresses this by concentrating optimization on planning tokens, achieving significant performance gains.

The connection to existing insights is illuminating. Since Which sentences actually steer a reasoning trace?, HICRA's planning tokens are likely the same phenomenon identified from a mechanistic perspective. The two-phase dynamic also explains why Do reasoning cycles in hidden states reveal aha moments? — the graph structure reflects the transition from procedural execution (local structure) to strategic planning (global topology).

Inquiring lines that use this note as a source 111

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 174 in 2-hop network ·dense cluster Open in graph ↗

Does RL training follow a predictable two-phase … Which sentences actually steer a reasoning trace? Do reasoning cycles in hidden states reveal aha mo… Does policy entropy collapse limit reasoning perfo… Does RL teach reasoning or just when to use it? What happens inside models when they suddenly gene… Can language modeling close the knowing-doing gap …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Which sentences actually steer a reasoning trace? Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
converges: planning tokens in HICRA likely correspond to thought anchors
Do reasoning cycles in hidden states reveal aha moments? What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.
extends: the two-phase dynamic explains how graph topology evolves during training
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
reframes: entropy collapse may be acceptable for execution tokens but catastrophic for planning tokens
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
deepens: the "when" is specifically about planning tokens; execution tokens are "how"
What happens inside models when they suddenly generalize? Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
analogous phased development: grokking's memorization-then-circuit-formation parallels the procedural-then-strategic progression; both show that generalization requires passing through a consolidation phase before higher-order structure emerges
Can language modeling close the knowing-doing gap in AI? Current LLMs reason well but act poorly in interactive tasks, while RL agents act well but cannot explain themselves. Can reformulating decision-making as language modeling with environmental feedback bridge this fundamental split?
TiG operates on the same procedural-vs-strategic axis HICRA identifies, but at the architectural level: language-as-policy refined by RL preserves declarative reasoning while building procedural competence — HICRA's two-phase dynamic predicts the order TiG observes during training

Does RL training follow a predictable two-phase learning sequence?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 5