INQUIRING LINE

Why must procedural skills consolidate before strategic reasoning can develop?

This explores a finding about the *order* in which reasoning ability is built: that models seem to lock in reliable execution ('how to carry out a step') before they can productively learn higher-level planning ('which steps to take') — and why that sequence might be necessary rather than accidental.


This explores a finding about the *order* learning happens in: the corpus suggests models can't develop good strategy until the basic mechanics of getting steps *right* are already dependable. The clearest evidence is a two-phase pattern observed across eight models during RL training — a first phase where the bottleneck is execution correctness, then a second phase where strategic planning becomes the thing worth optimizing. Tellingly, the entropy (uncertainty) on planning tokens *rises* in phase two while execution entropy settles down, which reads almost literally as: once the hands are steady, the model can afford to explore with its head Does RL training follow a predictable two-phase learning sequence?.

Why would this ordering be forced rather than optional? Look at what 'procedural knowledge' actually is. An analysis of five million pretraining documents found that reasoning generalization rides on broad, transferable procedural patterns — the reusable how-to of solving — rather than on memorized facts Does procedural knowledge drive reasoning more than factual retrieval?. Strategy is choosing *among* procedures. If the procedures themselves are unreliable, the strategic layer has nothing trustworthy to choose between: a good plan executed by shaky mechanics still fails, so the training signal can't cleanly reward the plan. Consolidating execution first is what makes strategic credit assignment even legible.

The failure modes that show up when this foundation is shaky are revealing. Reasoning models 'wander' (explore invalid paths) and 'underthink' (abandon promising paths too early) — and the fix isn't more compute but structural organization, since decoding-level nudges recover accuracy without retraining Why do reasoning models abandon promising solution paths?. That's a strategic-layer problem (knowing which path to commit to) sitting on top of capability that already exists. Relatedly, work on abstractions shows that strategy is really about allocating exploration well — breadth-first across diverse approaches rather than drilling one chain — and that this only pays off at larger compute budgets, i.e. *after* the basics are cheap enough to spend on planning Can abstractions guide exploration better than depth alone?.

There's a deeper reframing lurking here that the question doesn't ask but the corpus offers: maybe RL post-training doesn't *create* reasoning at all — it teaches the model *when* to deploy reasoning it already latently has, since base models contain the strategies before any RL and hybrid models recover 91% of gains just by routing tokens Does RL post-training create reasoning or just deploy it?. Under that view, 'procedural consolidation must come first' becomes: the raw operations pre-exist, and the strategic phase is about reliable *timing and selection* — which is exactly why the same thinking mechanism flips from counterproductive self-doubt to productive gap-analysis once training stabilizes its use Does extended thinking help or hurt model reasoning?.

Two cautions worth carrying away. First, this sequence isn't free: training hard for step-by-step procedure can *narrow* a model, making it overthink ill-posed questions and reason its way to wrong rules — strategic competence over-fitted to one shape of problem What critical thinking skills do reasoning models actually lose?. Second, 'strategic reasoning' isn't one thing — across 22 models it splits into distinct styles (minimax, trust-based, belief-anticipation) tied to game structure rather than raw depth Do large language models use one reasoning style or many?. So the honest version of the claim is: procedural reliability is the *substrate* strategy needs to stand on, but what gets built on that substrate is plural, and building it can quietly cost flexibility.


Sources 8 notes

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

What critical thinking skills do reasoning models actually lose?

Models trained for step-by-step reasoning excel at in-distribution logical tasks but lose critical abilities: they overthink ill-posed questions instead of disengaging, and reason their way to wrong rules on inductive tasks. This cognitive narrowing is partly reversible through targeted RL training.

Do large language models use one reasoning style or many?

Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.

Next inquiring lines