INQUIRING LINE

Does RL training actually restore the critical thinking that reasoning models lose?

This reads the question as: does RL post-training genuinely improve a reasoning model's thinking quality, or does extended 'reasoning' actually degrade judgment in ways RL only partly papers over — and the corpus suggests the honest answer is 'RL redirects and deploys thinking that's already there, rather than restoring something lost.'


This explores whether RL training fixes a thinking deficit in reasoning models — and the corpus reframes the question before answering it. The dominant finding isn't that models *lose* critical thinking and RL *restores* it; it's that RL mostly decides *when* to think rather than teaching the model *how*. Several independent lines converge here: base models already carry reasoning strategies in latent form, and minimal training simply elicits them rather than building them (Do base models already contain hidden reasoning ability?, Does RL teach reasoning or just when to use it?). One hybrid setup recovers 91% of the performance gain using only token-routing, which is strong evidence that RL is acting as a deployment optimizer, not a capability creator (Does RL post-training create reasoning or just deploy it?).

But the premise hiding in your question — that reasoning *hurts* — turns out to be real, and that's where RL does something closer to 'restoring.' Vanilla models, when told to think longer, often talk themselves into self-doubt that actively degrades their answers. RL reverses the sign: the same extended-thinking mechanism flips from counterproductive second-guessing into productive gap analysis (Does extended thinking help or hurt model reasoning?). So RL isn't recovering lost critical thinking so much as rehabilitating a mechanism that was misfiring — training mediates the *quality* of reasoning, not just the amount.

The catch is that 'better thinking' and 'better scores' can come apart. RL on theory-of-mind tasks shows scale-dependent collapse: larger models develop genuine, transferable belief-tracking, while smaller ones hit the same accuracy through shortcut learning with no real reasoning underneath — a gap invisible unless you read the step-by-step traces (Does reinforcement learning on theory of mind collapse with model scale?). Similarly, RLVR has been shown to sharpen sampling within a model's existing boundaries rather than push past them — a single example, or even spurious rewards, can trigger the gains, which is hard to square with 'teaching new reasoning' (What does reward learning actually do to model reasoning?). On this view, RL polishes what's already latent and risks rewarding the appearance of thought.

There's a genuine counter-current worth knowing about, though. *Prolonged* RL — with KL control, policy resetting, and training on non-mathematical tasks — outperforms base models across every pass@k level and discovers strategies the base model simply doesn't contain, especially in domains where there's no established pattern to elicit (Can reinforcement learning discover reasoning strategies base models cannot?). And rather than waiting for post-training to repair reasoning, some work plants chain-of-thought during pretraining itself with information-gain rewards, lifting reasoning before any 'loss' can occur (Can chain-of-thought reasoning be learned during pretraining itself?).

So the sharper takeaway: RL doesn't restore lost critical thinking — it redeploys, redirects, and occasionally extends thinking. If you want to go deeper on the mechanics, RL training follows a predictable two-phase arc where execution mastery comes first and strategic planning becomes the later bottleneck (Does RL training follow a predictable two-phase learning sequence?), and you can even reward metacognition directly — tagging planning, exploration, and reflection — to teach efficient reasoning rather than just correct answers (Can RL agents learn to reason better, not just succeed?).


Sources 10 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-model researcher re-testing claims about RL's role in critical thinking. The core question remains: does RL training restore reasoning capacity that models lose, or does it do something else entirely?

What a curated library found — and when (dated claims, not current truth):
Findings span April 2025–December 2025. A library of recent work converges on a surprising reframe:

• Base models already contain latent reasoning strategies; RL mostly teaches *when* to activate them, not *how* (arXiv:2510.07364, Oct 2025). Token-routing alone recovers 91% of RL gains, suggesting deployment optimization over capability creation.
• RL transforms extended thinking from counterproductive self-doubt into productive gap analysis — mediating reasoning *quality*, not just quantity (arXiv:2505.21493, May 2025).
• Scale-dependent reasoning collapse on theory-of-mind tasks: larger models develop genuine belief-tracking; smaller ones shortcut to accuracy without real reasoning, invisible in aggregate scores (arXiv:2504.01698, Apr 2025).
• RLVR sharpens sampling within existing boundaries; spurious rewards can trigger gains, raising questions about whether RL teaches or polishes (arXiv:2507.14843, Jul 2025).
• Prolonged RL with KL control discovers genuinely novel strategies inaccessible to base models, especially off-pattern domains (arXiv:2505.24864, May 2025).

Anchor papers (verify; mind their dates):
- arXiv:2510.07364 (Base Models Know How to Reason, Oct 2025)
- arXiv:2505.24864 (ProRL, May 2025)
- arXiv:2504.01698 (Theory of Mind reasoning collapse, Apr 2025)
- arXiv:2507.22844 (Meta-reasoning rewards, Jul 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For "RL teaches *when*, not *how*" — has work since Dec 2025 shown models discovering *genuinely* novel reasoning steps, or does the 91% token-routing result still hold? Does scale collapse on ToM persist with newer checkpoints? Separate: the durable question (what *actually* changes in reasoning under RL?) from the perishable claim (RL is purely deployment optimization). Cite what overturned or confirmed each.
(2) Surface the strongest *contradiction* from the last 6 months: prolonged RL discovers novel strategies vs. RLVR's evidence that RL polishes without expanding. Which framing dominates newer work? Where do they coexist?
(3) Propose 2 research questions assuming the regime may have shifted: (a) Can we now distinguish elicitation from capability expansion with fidelity, and if so, what fraction of post-training gains are which? (b) Does pretraining-time chain-of-thought (arXiv:2510.01265) obviate the need for RL-mediated reasoning repair, or do they compound?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines