INQUIRING LINE

How does the pretrained prior set a capability ceiling for reward model exploration?

This explores whether reward-based training (RLVR and friends) can push a model past what its pretraining already made possible — and the corpus answer is mostly no: rewards re-weight existing abilities rather than create new ones.


This explores whether reward-based training can push a model past what its pretraining already made possible. The collection's consistent answer: rewards mostly re-weight abilities the model already has, so the pretrained prior acts as a ceiling on what exploration can find. The clearest statement is that RLVR improves sampling efficiency within existing capability boundaries without expanding them — it activates pretraining strategies rather than teaching new reasoning What does reward learning actually do to model reasoning?. A striking pass@k result sharpens this: base models actually outperform RLVR-trained models at high k, meaning the reward process narrows sampling toward solutions already living in the base distribution rather than unlocking new ones Does RLVR actually expand what models can reason about?.

Why a ceiling and not a ramp? Because what reward exploration does is *select*, not *create*. Five independent methods — RL steering, critique fine-tuning, decoding tricks, SAE feature steering, and RLVR — all elicit reasoning that was already present in base-model activations, suggesting the bottleneck is elicitation, not capability acquisition Do base models already contain hidden reasoning ability?. If the behavior isn't latent in the prior, reward signals have nothing to amplify. That's also why a single training example, or even spurious rewards, can work nearly as well as carefully correct ones What does reward learning actually do to model reasoning?.

The ceiling has a hidden cost: exploration doesn't just stop expanding, it actively contracts. Reward maximization drives entropy collapse — policies converge on a few narrow high-reward strategies — and this shows up identically in reasoning and in search agents, while SFT on diverse demonstrations preserves the breadth Does reinforcement learning squeeze exploration diversity in search agents?. So the prior sets the upper bound on *what's reachable*, and reward training tends to shrink the explored region inside that bound rather than probe its edges.

What breaks through the ceiling? The corpus points to two levers that add genuinely new information instead of re-ranking old behavior. Distillation transfers reasoning patterns the base model didn't have Does RLVR actually expand what models can reason about?. And natural-language feedback does what scalar rewards can't: models stuck on a numerical-reward plateau start solving when given chain-of-thought critiques that explain *why* they failed — information a single reward number cannot carry Can natural language feedback overcome numerical reward plateaus?. The same theme runs through the demonstration-bound view of agents, where competence is capped by what the data curators imagined because the agent never interacts beyond it Can agents learn beyond what their training data shows?.

The surprising turn — and the thing worth knowing you wanted to know — is that the ceiling is being attacked at the other end of the pipeline. If post-training can only elicit what pretraining planted, then plant more during pretraining: RLP treats chain-of-thought as an exploratory action *during* pretraining itself, using information gain as a verifier-free reward, and lifts reasoning benchmarks substantially Can chain-of-thought reasoning be learned during pretraining itself?. That reframes the whole problem — the capability ceiling isn't a law of reward learning, it's a consequence of where in the training pipeline you let exploration happen.


Sources 7 notes

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher evaluating whether pretrained priors truly set an absolute ceiling on what reward-based exploration can achieve in LLMs. The question remains open: *Can post-training break through capability ceilings, or does it only re-weight latent abilities?*

What a curated library found — and when (spanning Sept 2024–May 2026, but treat as dated claims, not current truth):
• RLVR improves sampling efficiency within existing capability boundaries; base models outperform RLVR at high k, suggesting rewards select rather than create (2025-04).
• Five independent methods (RL steering, critique fine-tuning, decoding, SAE steering, RLVR) all elicit reasoning already latent in base-model activations, implying the bottleneck is elicitation, not acquisition (2025-06).
• Reward maximization drives entropy collapse—policies converge on narrow strategies—while SFT on diverse demonstrations preserves breadth (2025-07).
• Natural-language feedback breaks RL plateaus where scalar rewards plateau; chain-of-thought critiques carry information rewards cannot (2025-06).
• RLP treats chain-of-thought as exploratory action *during* pretraining using information gain as verifier-free reward, substantially lifting reasoning benchmarks (2025-10).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (Apr 2025): Does RL Really Incentivize Reasoning Beyond Base
• arXiv:2506.03106 (Jun 2025): Critique-GRPO: Natural Language and Numerical Feedback
• arXiv:2507.14843 (Jul 2025): The Invisible Leash: Why RLVR May Not Escape Its Origin
• arXiv:2510.01265 (Oct 2025): RLP: Reinforcement as Pretraining Objective

Your task:
(1) RE-TEST THE CEILING CLAIM. For each finding above, judge whether newer models (o3, o4-class), improved reward scalars (process-based, hierarchical), in-context RL, or scaffold-based training (e.g., tree search during inference, dynamic task decomposition) have since relaxed the elicitation-only bottleneck. Separate the durable question (can post-training add fundamentally new reasoning?) from the perishable limitation (current methods only re-rank). Cite what resolved each, plainly flag what still holds.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Does any recent paper show reward methods genuinely *expanding* the capability frontier rather than selecting within it? What mechanism?

(3) Propose 2 research questions that assume the regime may have shifted: e.g., *Can iterative reward refinement (reward learning on reward learning) break the single-reward bottleneck?* *Does multi-modality (vision + reasoning) alter what the prior contains and thus what rewards can access?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines