What structural differences emerge between early generic skills and later meta-strategy skills?
This explores how the 'early skills' a model masters first (the nuts and bolts of executing a procedure correctly) differ in shape and behavior from the 'later skills' it develops (choosing a strategy, planning, deciding when to do what) — and what the corpus says about why these are structurally distinct rather than just harder versions of the same thing.
This explores how early, generic execution skills differ structurally from the later meta-strategy skills a model develops — and the corpus points to a fairly consistent picture: they aren't a smooth continuum, they're two regimes with different mechanics. The clearest evidence comes from a study of RL training across eight models, which found a two-phase dynamic: a first phase where getting the execution right is the bottleneck, followed by a second phase where strategic planning becomes the thing that actually limits performance Does RL training follow a predictable two-phase learning sequence?. The structural tell is in the entropy: planning tokens stay high-entropy and keep exploring, while execution tokens settle down and stabilize. So 'early generic skill' looks like convergence (one right way to execute a step), and 'later meta-strategy skill' looks like sustained branching (many possible plans, the model keeps its options open).
That split shows up again when you decompose skills and watch how they scale. A 12-skill breakdown found that metacognition-style skills saturate early — around 7B parameters — while logical reasoning keeps improving well past 30B Do all AI skills improve equally as models scale?. In other words, different skill families have different growth curves, which is exactly what you'd expect if they're structurally distinct rather than one capability scaled up. The same note makes a sharper point: smaller models can imitate surface *style* convincingly but fail at *reasoning* — distillation copies the form, not the substance. A separate finding sharpens the knife: chains of thought built from logically *invalid* steps perform almost as well as valid ones, because the model is learning the *shape* of reasoning, not genuine inference Does logical validity actually drive chain-of-thought gains?. Early skills are often this kind of learned form; meta-strategy is where the structural content has to actually be there.
Here's the part you might not expect: several notes argue the meta-strategy layer isn't really *built* during training at all — it's *selected*. Base models already carry latent reasoning strategies in their activations, and post-training mostly elicits what's there rather than creating it Do base models already contain hidden reasoning ability?. Pushed further, one analysis frames RL post-training as teaching a model *when* to reason, not *how* — the strategies pre-exist as activation vectors, and training optimizes deployment timing Does RL post-training create reasoning or just deploy it?. If that's right, the structural difference between early and late skills is partly a difference between *acquiring* a procedure and *learning to route to* a strategy you already had.
The meta-strategy layer also has a distinctive geometry. One method shows reasoning works best when exploration goes breadth-first through diverse abstractions rather than drilling depth-first down a single chain — depth-only reasoning hits an 'underthinking' failure mode that structured breadth avoids Can abstractions guide exploration better than depth alone?. That maps cleanly onto the entropy story: meta-strategy *is* the breadth, the deliberate keeping-open of multiple plans. And the deep substrate underneath both phases seems to be procedural knowledge — reasoning generalizes from broad, transferable procedural patterns picked up across many documents, unlike factual recall which leans on narrow memorization Does procedural knowledge drive reasoning more than factual retrieval?.
One last wrinkle worth knowing: training *order* mechanically reshapes which skills survive. Structured tasks pull output entropy down while creative tasks push it up, and scheduling structured-first can damage open-ended capability through entropy collapse Does training order reshape how models handle different task types?. So the early-generic-vs-late-meta distinction isn't only about what the model learns — it's about sequence. Consolidate the convergent execution skills too aggressively and you can crush the high-entropy exploration that meta-strategy depends on. The two regimes don't just differ; they can be in tension.
Sources 8 notes
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.