Does RL refine existing knowledge or discover entirely new capabilities?
This explores a genuine fault line in the corpus: whether reinforcement learning only sharpens reasoning the base model already had, or whether it can build capability that wasn't there before — and the answer turns out to depend on what you're training on.
This explores whether RL is a *refiner* (surfacing latent ability) or a *discoverer* (creating genuinely new capability), and the collection is sharply split — which is the interesting part. One camp says refinement, full stop. Verifiable-reward training (RLVR) appears to narrow a model's sampling toward solutions already living in the base distribution rather than expanding the set of solvable problems; pass@k analysis shows base models actually *catching up to or beating* RLVR models at high k, the tell-tale sign that nothing new was added Does RLVR actually expand what models can reason about?. The same picture recurs in the claim that verifiable rewards act as catalysts surfacing pretrained strategies, not teachers building new ones — strikingly, a single example, or even spurious rewards, can suffice to 'activate' the behavior How does RL training reshape reasoning and what gets lost? What does reward learning actually do to model reasoning?.
The sharpest version of the refinement story reframes the whole thing: RL teaches *when* to reason, not *how*. Base models already carry reasoning strategies in latent form — activation vectors for them exist before any RL — and hybrid models recover 91% of the gains just by routing tokens. On this view RL post-training is a deployment optimizer, not a capability creator Does RL post-training create reasoning or just deploy it?. There's even a counterintuitive twist: in domains like medicine, RL can improve reasoning by *removing* knowledge — pruning trajectories that invoke wrong facts — so 'better' sometimes means 'subtracting,' not adding Does RL improve domain reasoning by adding knowledge or removing it?.
But the opposing camp has receipts too. Prolonged RL on *diverse, non-mathematical* tasks — with KL control and policy resetting — produces models that beat the base across *all* pass@k levels, which is exactly the signature the refinement camp says proves expansion didn't happen. Here it does happen, especially in domains where the base model has no established patterns to fall back on Can reinforcement learning discover reasoning strategies base models cannot?. That points to the reconciliation the corpus quietly offers: capability creation is *domain-conditional*. For standard reasoning, RL activates what's already there; for complex multi-step planning, it generates novel strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?.
What decides which mode you get? Two levers stand out. First, reward verifiability: binary checkable rewards unlock dramatic gains (one task jumps from 0.15% to ~74%), while fuzzy judgment-based signals barely move the needle — clear signals 'unlock suppressed capabilities,' which still sounds like activation Why does RL succeed more on some tasks than others?. Second, task horizon and structure: RL scales to long-horizon multi-turn software engineering, doubling SWE-bench performance, showing it works in genuinely stateful environments and not just toy single-turn settings Can reinforcement learning scale beyond single-turn language tasks?. And there's a developmental shape to it — training moves through two phases, first nailing execution correctness, then shifting the bottleneck to strategic planning, where the novel-strategy gains actually live Does RL training follow a predictable two-phase learning sequence?.
The thing you might not have known to ask: this 'refine vs. discover' question is entangled with a *cost*. The same mechanism that sharpens a model — entropy collapse, policies converging on a narrow reward-maximizing path — also crushes exploration diversity, in search agents just as in reasoning. SFT on diverse demonstrations preserves breadth where RL squeezes it Does reinforcement learning squeeze exploration diversity in search agents?. So even the 'discovery' wins come with a homogenizing pull, and the order you train domains in can protect open-ended creativity from being flattened by structured tasks Does training order reshape how models handle different task types?. Refinement and discovery aren't a clean binary — they're two ends of a dial set by your rewards, your domain, and how long you push.
Sources 12 notes
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
RL enhances medical reasoning by suppressing incorrect domain knowledge during reasoning—not by expanding what models know. Evidence shows RL achieves +12.4 point knowledge improvement by removing low-reward reasoning trajectories that invoke wrong facts.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.
Binary verifiable rewards enable dramatic RL gains (0.15% to 73.98%), while judgment-based evaluation yields modest improvements (55% reduction). Clear reward signals unlock suppressed capabilities; fuzzy signals barely move the needle.
Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.