How do extrapolative and contextual generalization measure RL reasoning gains?

This explores the measurement problem behind RL reasoning claims — how we tell whether RL actually extends a model's reasoning (extrapolation past what it was trained on) versus just sharpening reasoning it already had access to in familiar contexts.

This explores the measurement problem behind RL reasoning claims: the gap between *looking* better on benchmarks and *being* able to reason about genuinely new problems. The corpus turns out to be organized around exactly this fault line, and the single most load-bearing measurement tool in it is the pass@k curve — how well a model does when allowed k attempts at a problem. That curve is what separates 'RL made the model find existing answers faster' from 'RL gave the model answers it couldn't reach before.'

The deflationary reading comes through clearest in the pass@k analyses. One line of work shows base models actually *overtake* RLVR-trained models at high k: with enough attempts, the untrained base model solves problems the RL model can't, which means RL narrowed sampling toward solutions already living in the base distribution rather than expanding the set of solvable problems Does RLVR actually expand what models can reason about?. The companion finding is that reward learning 'activates' pretraining strategies rather than teaching new ones — a single training example, or even a spurious reward, suffices to trigger the gains What does reward learning actually do to model reasoning?. Read this way, the gains are *contextual*: RL teaches the model *when* to deploy reasoning it already had, not *how* to reason Does RL post-training create reasoning or just deploy it?. Several independent mechanisms — steering vectors, SAE features, decoding tweaks — all elicit the same latent capability, suggesting the bottleneck is elicitation, not acquisition Do base models already contain hidden reasoning ability?.

But the extrapolation tests cut the other way once you change the experimental conditions. Prolonged RL on *diverse, non-mathematical* tasks — with KL control and policy resetting — produces models that beat the base model across *all* pass@k levels, which is the signature of genuine boundary expansion rather than sampling refinement, especially in domains where the base model never had established patterns to fall back on Can reinforcement learning discover reasoning strategies base models cannot?. A controlled synthetic study reconciles the contradiction: RL produces real capability gains only under two conditions — pretraining left *headroom*, and the RL data *targets the edge of the model's competence*. Absent those, RL just refines sampling When does RL actually extend reasoning beyond pretraining?. So the disagreement in the literature is partly a disagreement about whether the experiment was set up where extrapolation was even possible.

The cleanest extrapolation measurement, though, isn't pass@k at all — it's distributional stress-testing. The DataAlchemy experiments hold the model fixed and systematically shift the task, the length, and the format away from training, then watch reasoning degrade *predictably* with distance. Chain-of-thought turns out to be distribution-bounded: outside its training neighborhood it produces fluent, confident, logically *invalid* reasoning — it imitates the form of reasoning without the underlying logic Does chain-of-thought reasoning actually generalize beyond training data?. This is the contextual-vs-extrapolative distinction made into a dial you can turn. Contextual gains survive small shifts; only true capability survives large ones.

What you didn't know you wanted to know: the measurement debate is quietly reshaping what RL gets *built* to do. If the gain is mostly elicitation, you optimize the signal — reusing a single variance statistic as both reward and query filter for faster, stabler training Can one statistical measure serve dual purposes in RL training?, or dropping verifiers entirely by scoring reference-answer likelihood Can reasoning improvement work without answer verification?. And the structural evidence fits the elicitation story uncomfortably well: RL updates only 5–30% of parameters, in sparse but nearly full-rank subnetworks that are almost identical across random seeds — it's selecting a pre-existing structure, not rewiring the model Does reinforcement learning update only a small fraction of parameters?. The frontier that genuinely tests extrapolation may instead be allocating test-time compute to diverse *abstractions* rather than more samples, forcing breadth-first exploration that depth-only chains never reach Can abstractions guide exploration better than depth alone?.

Sources 11 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

When does RL actually extend reasoning beyond pretraining?

A controlled synthetic framework shows RL produces true capability gains only when pretraining established reasoning primitives and RL data targets tasks at the boundary of the model's competence. Without these conditions, RL refines sampling rather than extending capability.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Can reasoning improvement work without answer verification?

VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

How do extrapolative and contextual generalization measure RL reasoning gains?

Sources 11 notes

Next inquiring lines