INQUIRING LINE

Does RL training activate latent meta-learning capacity or create it from scratch?

This explores a sharp version of the activation-vs-creation debate: when RL gives a model the ability to learn-on-the-fly (meta-learning), is it switching on something the base model already had, or building a genuinely new ability — and the corpus suggests the honest answer is 'it depends what you're asking RL to do.'


This explores whether RL training switches on a latent meta-learning ability or builds one from nothing, and the corpus refuses to give you a single clean answer — instead it draws a line based on task difficulty. The dominant view across several notes is activation, not creation. Base models already carry reasoning ability in latent form, and five independent techniques — RL steering, critique fine-tuning, decoding tweaks, SAE feature steering, and RLVR — all elicit reasoning that was already sitting in the activations Do base models already contain hidden reasoning ability?. One striking framing is that RL teaches a model *when* to reason, not *how*: hybrid models recover 91% of the gains just by routing tokens, and the activation patterns for reasoning strategies pre-exist any RL at all Does RL post-training create reasoning or just deploy it?. Reward learning in this view sharpens sampling efficiency inside an existing capability boundary rather than widening it — a single example can trigger activation, and even spurious rewards work nearly as well as correct ones What does reward learning actually do to model reasoning?.

But the same corpus contains its own counter-evidence, and that's where it gets interesting. The cleanest reconciliation says capability creation is *domain-conditional*: for standard reasoning, RL activates what's latent; for complex, multi-step planning, RL generates genuinely novel strategies the base model can't reach even with massive sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. Prolonged RL on diverse, non-mathematical tasks — with KL control and policy resetting — pushes models past base-model performance at *every* pass@k level, which is the signature of an expanded boundary, not just better sampling Can reinforcement learning discover reasoning strategies base models cannot?. So whether you see 'activation' or 'creation' depends heavily on how hard and how unfamiliar the task is.

Now the part you didn't know you wanted: meta-learning specifically may be the cleanest case of genuine creation. RL-finetuned transformers develop *in-context reinforcement learning* — they solve unseen problems through within-episode adaptation at human-level sample efficiency, with no weight updates during the solving. This emerges from RL's training pressure combined with the transformer's context window Can transformers learn to solve new problems within episodes?. That's not eliciting a stored answer; it's installing a *procedure* for learning from experience on the fly. So the meta-learning version of your question may land differently than the reasoning version — RL looks more like a creator of learn-to-learn behavior than of object-level reasoning.

What actually changes inside the model during this points the same direction. RL rewrites only 5–30% of parameters, but those updates are nearly full-rank and nearly identical across random seeds — structural surgery, not arbitrary noise Does reinforcement learning update only a small fraction of parameters?. And the learning unfolds in two phases: first procedural mastery (getting execution right), then a shift where strategic planning becomes the bottleneck and planning-token entropy climbs Does RL training follow a predictable two-phase learning sequence?. That second exploratory phase is exactly where you'd expect new strategy — including meta-strategy — to be forged rather than merely surfaced.

Two cautions worth carrying. RL is also a narrowing force: it tends to collapse onto a single dominant pretraining format within the first epoch, suppressing alternatives, and the winning format tracks model scale rather than performance Does RL training collapse format diversity in pretrained models?. And pushing 'creation' too hard backfires — training on near-impossible problems teaches degenerate shortcuts that contaminate abilities the model already had Do overly hard RLVR samples actually harm model capabilities?. The takeaway: RL mostly activates what's latent, but at the frontier of difficulty — and especially for learn-to-learn behavior — it can build something new, provided the task is hard enough to demand it but not so hard it rewards cheating.


Sources 10 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Can transformers learn to solve new problems within episodes?

Llama 3.1 8B fine-tuned with RL exhibits emergent in-context reinforcement learning, solving unseen problems through within-episode adaptation at human-level sample efficiency. This meta-learning emerges from RL's training pressure combined with the transformer's context window, without weight updates.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Next inquiring lines