SYNTHESIS NOTE
Model Architecture and Internals Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Does RL training follow a predictable two-phase learning sequence?

This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

Across eight text-only and vision-language models, RL training reveals a consistently two-phase dynamic. In the first phase, the learning bottleneck is procedural correctness — a single calculation error invalidates an entire solution, creating powerful gradient signal that compels mastery of low-level execution tokens (arithmetic, variable substitution, formula application). In the second phase, the bottleneck shifts to strategic planning — exploring and mastering high-level planning tokens (deduction like "we can use the fact that," branching like "let's try a different approach," backtracing like "but the problem mentions that").

The phases are not mutually exclusive. Procedural refinement continues throughout training. But the primary driver of marginal performance gains shifts to strategic planning. This is why the "aha moment" phenomenon appears when it does — it represents the discovery and internalization of high-level reasoning strategies, which only becomes the active learning frontier after procedural skills are consolidated.

The entropy dynamics tell the same story. Planning tokens show increasing strategic diversification over training — the model explores new ways to combine established skills. Execution tokens show stable conditional entropy — once arithmetic is mastered, there's little incentive to find diverse ways to perform it. The performance improvement comes from discovering new combinations of established skills, which is the core function of planning.

This insight exposes a core inefficiency in algorithms like GRPO that apply optimization pressure uniformly across all tokens. If the learning frontier is in planning tokens but gradient signal is diluted across execution tokens, optimization is wasteful. HICRA addresses this by concentrating optimization on planning tokens, achieving significant performance gains.

The connection to existing insights is illuminating. Since Which sentences actually steer a reasoning trace?, HICRA's planning tokens are likely the same phenomenon identified from a mechanistic perspective. The two-phase dynamic also explains why Do reasoning cycles in hidden states reveal aha moments? — the graph structure reflects the transition from procedural execution (local structure) to strategic planning (global topology).

Inquiring lines that use this note as a source 111

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 174 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl training exhibits a two-phase dynamic where procedural consolidation precedes strategic planning exploration