How does the pretrained prior set a capability ceiling for reward model exploration?
This explores whether reward-based training (RLVR and friends) can push a model past what its pretraining already made possible — and the corpus answer is mostly no: rewards re-weight existing abilities rather than create new ones.
This explores whether reward-based training can push a model past what its pretraining already made possible. The collection's consistent answer: rewards mostly re-weight abilities the model already has, so the pretrained prior acts as a ceiling on what exploration can find. The clearest statement is that RLVR improves sampling efficiency within existing capability boundaries without expanding them — it activates pretraining strategies rather than teaching new reasoning What does reward learning actually do to model reasoning?. A striking pass@k result sharpens this: base models actually outperform RLVR-trained models at high k, meaning the reward process narrows sampling toward solutions already living in the base distribution rather than unlocking new ones Does RLVR actually expand what models can reason about?.
Why a ceiling and not a ramp? Because what reward exploration does is *select*, not *create*. Five independent methods — RL steering, critique fine-tuning, decoding tricks, SAE feature steering, and RLVR — all elicit reasoning that was already present in base-model activations, suggesting the bottleneck is elicitation, not capability acquisition Do base models already contain hidden reasoning ability?. If the behavior isn't latent in the prior, reward signals have nothing to amplify. That's also why a single training example, or even spurious rewards, can work nearly as well as carefully correct ones What does reward learning actually do to model reasoning?.
The ceiling has a hidden cost: exploration doesn't just stop expanding, it actively contracts. Reward maximization drives entropy collapse — policies converge on a few narrow high-reward strategies — and this shows up identically in reasoning and in search agents, while SFT on diverse demonstrations preserves the breadth Does reinforcement learning squeeze exploration diversity in search agents?. So the prior sets the upper bound on *what's reachable*, and reward training tends to shrink the explored region inside that bound rather than probe its edges.
What breaks through the ceiling? The corpus points to two levers that add genuinely new information instead of re-ranking old behavior. Distillation transfers reasoning patterns the base model didn't have Does RLVR actually expand what models can reason about?. And natural-language feedback does what scalar rewards can't: models stuck on a numerical-reward plateau start solving when given chain-of-thought critiques that explain *why* they failed — information a single reward number cannot carry Can natural language feedback overcome numerical reward plateaus?. The same theme runs through the demonstration-bound view of agents, where competence is capped by what the data curators imagined because the agent never interacts beyond it Can agents learn beyond what their training data shows?.
The surprising turn — and the thing worth knowing you wanted to know — is that the ceiling is being attacked at the other end of the pipeline. If post-training can only elicit what pretraining planted, then plant more during pretraining: RLP treats chain-of-thought as an exploratory action *during* pretraining itself, using information gain as a verifier-free reward, and lifts reasoning benchmarks substantially Can chain-of-thought reasoning be learned during pretraining itself?. That reframes the whole problem — the capability ceiling isn't a law of reward learning, it's a consequence of where in the training pipeline you let exploration happen.
Sources 7 notes
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.