INQUIRING LINE

Can RL create new reasoning primitives that pretraining never established?

This explores whether reinforcement learning can build reasoning abilities that pretraining never put there in the first place — or whether RL only sharpens and reschedules abilities the base model already had.


This explores whether RL can build genuinely new reasoning abilities versus only surfacing ones pretraining already planted — and the corpus is unusually split, which is the interesting part. The dominant finding is deflationary: RL mostly *elicits* what's already latent. Pass@k analysis shows base models actually matching or beating RLVR-trained models at high sampling counts, which means RL narrowed the model toward solutions already in its distribution rather than adding new ones Does RLVR actually expand what models can reason about?. Several independent results converge here: a single training example can trigger the gains, and even spurious rewards work nearly as well as correct ones — a signature of activation, not teaching What does reward learning actually do to model reasoning? How does RL training reshape reasoning and what gets lost?. One framing puts it sharply: RL teaches a model *when* to reason, not *how*, and hybrid models recover most of the gains just by routing tokens Does RL post-training create reasoning or just deploy it?. If reasoning strategies show up as steerable activation directions before any RL, the bottleneck was never capability — it was elicitation Do base models already contain hidden reasoning ability?.

But the corpus refuses to settle there. Prolonged RL on diverse, non-mathematical tasks — with KL control and policy resetting — produces models that beat the base across *all* pass@k levels, which is the one result the 'just sampling' story can't explain Can reinforcement learning discover reasoning strategies base models cannot?. The reconciling insight is conditional: capability creation depends on the task. For standard reasoning, RL activates latent ability; for complex multi-step planning, it generates genuinely novel strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. A controlled synthetic study makes the precondition explicit — RL extends reasoning only when pretraining already laid down the right primitives *and* the RL data targets the edge of competence; absent either, you get sampling refinement, not new capability When does RL actually extend reasoning beyond pretraining?.

So the answer to your literal question is closer to 'rarely, and only under specific conditions' than a clean yes. RL doesn't seem to manufacture primitives from nothing — it works with the raw material pretraining left, and creates something new mainly where there's headroom and the task demands compositional planning the base never practiced.

The sharper move the corpus suggests is to stop asking RL to do this job alone. If new reasoning primitives have to come from somewhere, plant them earlier: treating chain-of-thought as an exploratory action *during* pretraining, rewarded by information gain, lifts reasoning ~19% Can chain-of-thought reasoning be learned during pretraining itself?, and looped architectures bake iterative reasoning into latent space at pretraining time Can reasoning happen in latent space during pretraining?. Meanwhile, you can extract latent primitives with no RL at all — modular cognitive tools alone pushed GPT-4.1 on AIME from 27% to 43% Can modular cognitive tools unlock reasoning without training?, and verifier-free methods extend reasoning to general domains without the reward machinery Can reasoning improvement work without answer verification?. The thing you didn't know you wanted to know: the question 'can RL create new primitives?' may be the wrong altitude — the field is quietly shifting the burden of *creating* primitives back into pretraining, and leaving RL the narrower, real job of deciding when to deploy them.


Sources 12 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

When does RL actually extend reasoning beyond pretraining?

A controlled synthetic framework shows RL produces true capability gains only when pretraining established reasoning primitives and RL data targets tasks at the boundary of the model's competence. Without these conditions, RL refines sampling rather than extending capability.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can reasoning happen in latent space during pretraining?

Ouro models achieve 2–3× efficiency gains by performing iterative reasoning in latent space during pretraining, not through extra capacity. Their intermediate predictions align faithfully with final outputs, making latent traces more honest than explicit chain-of-thought reasoning.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can reasoning improvement work without answer verification?

VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.

Next inquiring lines