What actually changes inside a model during RL training? · Gravity7

What RL Modifies

5 notes

Does reinforcement learning update only a small fraction of parameters?

Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.

Does negative reinforcement alone outperform full reinforcement learning?

Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.

Can reinforcement learning discover reasoning strategies base models cannot?

Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.

Does reinforcement learning create new reasoning abilities or activate existing ones?

RL post-training might either unlock latent capabilities in base models or genuinely create novel strategies. Understanding which happens under what conditions clarifies how to invest in model training effectively.

Does RL training follow a predictable two-phase learning sequence?

This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.

RL Formalization and Architecture

5 notes

Does thinking emerge when agents choose between learned sub-policies?

Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.

Can two simple techniques match complex RL algorithms?

Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.

Can RL training run while generation continues without waiting?

Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?

Can agent deployment itself generate training signals automatically?

Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.

Can scalar rewards capture all the information in agent feedback?

Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.

Training Dynamics

4 notes

Why does SFT-then-RL training follow a predictable three-phase pattern?

When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.

Does gradually tightening token budgets beat fixed budget training?

Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.

Can natural language feedback overcome numerical reward plateaus?

Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.

Can chain-of-thought reasoning be learned during pretraining itself?

Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.

Novel Reward Paradigms

3 notes

Can models learn what makes research worth doing?

Can large language models be trained to recognize high-impact research directions by learning from citation patterns? This explores whether 'scientific taste'—the judgment of what work matters—is a learnable skill separate from execution.

Can reward models learn by comparing policies instead of judging them?

What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?

Do unimodal reward models actually serve all user preferences?

Standard RLHF assumes a single utility function across all users, but what happens when preferences genuinely conflict? Does averaging these opposing preferences into one model systematically fail certain groups?

Verifier-Free RL

4 notes

Can language models replace reward models with internal signals?

Recent RL research shows three independent patterns—self-judgment, belief-shift, and rich feedback—that each eliminate a component of the traditional RLHF stack. Are these patterns converging on a fundamentally different architecture for training without external verifiers?

Can models learn to judge themselves without external rewards?

Can a language model train itself by alternating between generating responses and evaluating them using only internal consistency signals? This explores whether evaluation itself can become a learnable skill without external supervision.

Can an agent's own beliefs guide credit assignment without critics?

Explore whether an agent's shifting probability estimates toward the correct answer could serve as a self-contained reward signal for long-horizon reinforcement learning, eliminating the need for separate process reward models or external verifiers.

Can environment feedback replace scalar rewards in policy learning?

Can rich tokenized feedback from environments serve as a direct learning signal for policies, without relying on compressed scalar rewards? This matters because scalar rewards discard information needed for credit assignment.

LLM-as-Reward-Engineer

1 note

Can LLMs design reward functions for reinforcement learning?

Can language models help automate the notoriously difficult task of designing reward shaping functions for sparse-reward RL, and if so, how might we structure that collaboration to work around LLMs' weaknesses in stochastic control?

Continual Agent Adaptation Architectures (added 2026-05-18 from Arxiv/Agents Multi Architecture.md)

2 notes

Can agents adapt without pausing service to users?

Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.

Can a separate trained curator improve skill libraries better than frozen agents?

Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.

Process Rewards and Judges

2 notes

Can judges that reason about reasoning outperform classifier rewards?

Can process reward models generate explanations about why steps are correct rather than simply classifying them? This explores whether meta-reasoning about reasoning improves both accuracy and generalization in step-level evaluation.

Can adversarial critics replace task-specific verifiers for reasoning?

Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.

Multi-Turn and Sequential RL

3 notes

Can full episode rewards per step enable better credit assignment?

Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.

Can reinforcement learning scale beyond single-turn language tasks?

Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.

How does treating LLMs as multi-step agents change what we can optimize?

Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.

Scaling and Methodology

2 notes

Does network depth unlock qualitatively new behaviors in RL?

Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.

Does RL training follow predictable scaling curves?

Can we forecast where RL training will plateau before committing full compute? ScaleRL tests whether sigmoid curves reliably predict performance ceilings across 200+ models.

Alignment and Personalization

2 notes

Can text summaries beat embeddings for personalized reward models?

When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.

Why do language models fail to act on their own reasoning?

LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?

Fine-Tuning Side Effects

3 notes

Does fine-tuning disconnect reasoning steps from final answers?

When models are fine-tuned on specific domains, do their chain-of-thought steps become less causally connected to their outputs? Three experiments test whether reasoning chains remain functionally faithful after training.

Do pretraining and fine-tuning scale independently in language models?

Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.

Can utility-weighted training loss actually harm model performance?

When engineers weight loss functions to reflect real-world costs of different errors, does this improve or undermine learning? This explores whether baking asymmetric objectives into training creates unintended side effects.

Parameter-Efficient and Alternative Tuning

5 notes

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.

Can semantic knowledge shift model behavior like reinforcement learning does?

Can textual descriptions of successful reasoning patterns, prepended as context, achieve the same distribution shifts that RL achieves through parameter updates? This matters because it could eliminate the need for expensive fine-tuning on limited data.

Can context playbooks prevent knowledge loss during iteration?

When AI systems iteratively refine their instructions and memories, do structured incremental updates better preserve domain knowledge than traditional rewriting? This matters because context degradation undermines long-term agent performance.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Explores whether identifying and protecting task-specific parameter regions can prevent the performance degradation that occurs when fine-tuning models on multiple tasks simultaneously. This matters because it could enable safe multi-task adaptation without sacrificing individual task performance.

Can models learn to ignore irrelevant prompt changes?

Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.

Data Selection and Reasoning Architecture

2 notes

Can we train better models on less data?

Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.

Can abstractions guide exploration better than depth alone?

Does training a model to propose reasoning abstractions as intermediate subgoals help it explore diverse solution strategies more effectively than simply extending chain-of-thought depth?

Core Insights

2 notes

How should multiple reward objectives be weighted during training?

When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.

Can reward vectors be the hidden source of solution diversity?

Standard RL collapses multi-dimensional rewards into scalars before training, losing the natural structure that could drive diverse specialization. What if that vector structure itself is the diversity axis?

DRO and Self-Supervised Dense Signals (2026-05-18)

3 notes

Can we identify which tokens actually matter for reasoning?

Most tokens in an answer are determined by language patterns rather than reasoning. Is there a way to distinguish the small fraction of tokens whose certainty genuinely depends on the chain of thought?

Can rubrics and dense rewards work together without hacking?

Explores whether reward signals derived from rubrics suffer from exploitation, and whether separating rubric judgments from optimization signals could prevent this failure mode.

Can one statistical measure serve dual purposes in RL training?

Explores whether cross-rollout variance can simultaneously weight important tokens and filter low-signal queries, potentially unlocking efficiency gains in reasoning tasks without human labels.

Related Areas

7 notes

What does reward learning actually do to model reasoning?

Explores whether RLVR expands reasoning capabilities or merely activates latent skills. Investigates the mechanism by which rewards reshape model outputs and whether this constitutes genuine learning or efficient sampling.

How well do reward models actually evaluate AI reasoning?

Reward models are central to training better AI systems, but do they truly assess reasoning quality or do they rely on shortcuts? This explores whether these evaluators work as intended.

How should reasoning systems actually be architected?

This explores the fundamental design choices for building reasoning into AI systems—from when to activate reasoning versus how to execute it, to whether reasoning must be verbal or can happen in latent space.

How does RL training reshape reasoning and what gets lost?

Explores how reinforcement learning modifies model capabilities during training, what verifiable rewards actually accomplish, and what side effects emerge in the process. Why understanding these mechanisms matters for building reliable AI systems.

How does RL training reshape reasoning and what gets lost?

Explores how reinforcement learning modifies model capabilities during training, what verifiable rewards actually accomplish, and what side effects emerge in the process. Why understanding these mechanisms matters for building reliable AI systems.

How should we allocate compute budget at inference time?

Test-time scaling explores how to spend computational resources during query rather than training. The core challenge: given a fixed inference budget, what's the optimal allocation strategy for different problems?

How should researchers navigate LLM reasoning research?

This note explores how to systematically explore interconnected insights about test-time scaling, reasoning architectures, and language model cognition. It matters because LLM research spans multiple domains—from inference compute to philosophy—and understanding the map helps identify novel connections.