INQUIRING LINE

Can models learn both what and how to study through reinforcement learning?

This explores whether reinforcement learning can teach a model two different things at once — the content it reasons over (the 'what') and the strategy for deploying that reasoning (the 'how') — rather than just one or the other.


This explores whether RL can teach a model both the content it reasons over and the strategy for using it. The most direct evidence that these are genuinely separate things comes from work showing RL training splits into two phases: an early phase where the bottleneck is execution correctness — getting steps right — and a later phase where the bottleneck shifts to strategic planning, with planning-token entropy rising while execution stabilizes Does RL training follow a predictable two-phase learning sequence?. In other words, 'how to execute' and 'what to plan' aren't learned simultaneously; the model consolidates procedure first, then learns where to spend its reasoning. That ordering suggests the two are learnable, but on different clocks.

The harder question is whether RL teaches genuinely *new* content at all, or just reorganizes what's already there. A skeptical thread argues RL mostly sharpens existing ability: reward learning activates pretraining strategies rather than installing new ones, a single example can trigger the effect, and even spurious rewards work nearly as well What does reward learning actually do to model reasoning?. Pass@k analysis sharpens the point — base models actually beat RLVR models at high sampling budgets, implying RL narrows the search toward solutions already in the distribution rather than expanding the boundary of what's solvable; distillation, by contrast, is what transfers genuinely new reasoning patterns Does RLVR actually expand what models can reason about?. On this view, RL teaches 'how to deploy' far more readily than 'what's newly knowable.'

But other corners of the corpus push back, suggesting RL *can* embed content when the reward is shaped to demand it. RLAG rewards both answer accuracy and the rationality of the explanation, cycling between augmented and unaugmented generation to progressively internalize coherent knowledge structures — and it beats supervised fine-tuning precisely because it optimizes reasoning quality over token-level matching Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. Similarly, sophisticated domain reasoning has been shown to emerge from RL on hard problems with nothing but basic accuracy signals, no teacher-distilled chains required Can simple rewards alone teach complex domain reasoning?. The disagreement here is real and worth sitting with: it may hinge on whether the 'new' capability was latent in pretraining or not.

A quieter set of findings reframes the question as *which signal you reward determines which of 'what' and 'how' the model picks up.* SkillRL treats successful episodes as concrete demonstrations (the 'what') and failures as abstracted lessons (the 'how to avoid'), and this asymmetric processing — mirroring how human experts learn — outperforms treating all trajectories uniformly Should successful and failed episodes be processed differently?. Decomposition methods go further: breaking instruction-following into verifiable checklist sub-criteria lets RL grade subjective quality Can breaking down instructions into checklists improve AI reward signals?, and breaking question-quality into attributes like clarity and specificity teaches models to ask better clarifying questions Can models learn to ask genuinely useful clarifying questions?. The lesson across both: 'how to study' is teachable, but only when you decompose the skill finely enough that a reward can point at it.

The unexpected turn is that 'how to study' can become fully internal. Post-Completion Learning trains a model to compute its *own* reward in the unused sequence space after its output — internalizing self-evaluation rather than leaning on an external reward model, at zero inference cost Can models learn to evaluate their own work during training?. And reasoning itself can be planted earlier than fine-tuning: treating chain-of-thought as an exploratory action *during pretraining*, rewarded by information gain, lifts reasoning benchmarks well before any task-specific RL Can chain-of-thought reasoning be learned during pretraining itself?. So the honest answer is yes — models can learn both — but 'what' and 'how' respond to different reward shapes, on different schedules, and the most interesting frontier is models that learn how to evaluate their own learning.


Sources 10 notes

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can simple rewards alone teach complex domain reasoning?

Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability analyst. The question: **Can models learn both what (content/knowledge) and how (strategy/reasoning process) to study through reinforcement learning?** This remains open—the regime may have shifted since mid-2025.

**What a curated library found — and when (findings span 2024–10/2025, treat as dated claims):**
- RL training exhibits a two-phase dynamic: early phase locks down execution correctness, later phase shifts to strategic planning (planning-token entropy rises). Suggests 'how' and 'what' learn on different clocks (~2024–25).
- RLVR (base RL) may only sharpen existing capability; pass@k analysis shows base models outperform RLVR at high budgets, implying RL narrows search rather than expanding reasoning boundaries. Distillation transfers genuinely new patterns (~2025).
- **Counter-evidence:** RL with shaped rewards (e.g., RLAG rewarding explanation rationality, or RL on hard problems) embeds new domain reasoning without distillation; this hinges on whether capability was latent in pretraining (~2025).
- Decomposed reward signals (checklist-based grading, attribute-level breakdowns) teach 'how to study' by making skills granular enough for a reward to target (~2025).
- Post-Completion Learning internalizes self-evaluation in unused sequence space post-EOS, removing external reward cost; chain-of-thought as pretraining exploratory action (info-gain reward) lifts reasoning before task-specific RL (~2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2504.13837 (Does RLVR Expand Reasoning, 2025-04)
- arXiv:2507.18624 (Checklists vs. Reward Models, 2025-07)
- arXiv:2507.20252 (Post-Completion Learning, 2025-07)
- arXiv:2509.20162 (RLAG: Domain Knowledge via RL Augmented Generation, 2025-09)

**Your task:**
(1) **RE-TEST THE TWIN CONSTRAINTS:** Does newer data (Oct–Dec 2025) show that (a) base-model superiority at high pass@k still holds against improved RL methods, and (b) the two-phase dynamic (execution→planning) still appears or has been flattened by better curricula? Cite what resolved or preserved each.
(2) **Surface the strongest contradicting work from the last 6 months:** Which recent papers most directly challenge the 'RLVR-only-sharpens' skeptical thread, or the 'latency of capability' hypothesis?
(3) **Propose two research directions that assume the regime moved:** (a) If decomposed/internal rewards have made RL capable of seeding new content, what does that mean for the latent-vs.-emergent distinction in pretraining? (b) If reasoning-as-exploration during pretraining works, does end-to-end RL during pretraining (not finetuning) flip the 'what vs. how' separation entirely?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Next inquiring lines