← All notes

What actually changes inside a model during RL training?

How RL training mechanically reshapes model parameters, dynamics, and reasoning strategies through sparse updates and suppression.

Topic Hub · 55 linked notes · 17 sections
View as

What RL Modifies

5 notes

Does reinforcement learning update only a small fraction of parameters?

Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.

Explore related Read →

Does negative reinforcement alone outperform full reinforcement learning?

Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.

Explore related Read →

Can reinforcement learning discover reasoning strategies base models cannot?

Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.

Explore related Read →

Does reinforcement learning create new reasoning abilities or activate existing ones?

RL post-training might either unlock latent capabilities in base models or genuinely create novel strategies. Understanding which happens under what conditions clarifies how to invest in model training effectively.

Explore related Read →

Does RL training follow a predictable two-phase learning sequence?

This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.

Explore related Read →

RL Formalization and Architecture

5 notes

Does thinking emerge when agents choose between learned sub-policies?

Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.

Explore related Read →

Can two simple techniques match complex RL algorithms?

Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.

Explore related Read →

Can RL training run while generation continues without waiting?

Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?

Explore related Read →

Can agent deployment itself generate training signals automatically?

Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.

Explore related Read →

Can scalar rewards capture all the information in agent feedback?

Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.

Explore related Read →

Training Dynamics

4 notes

Why does SFT-then-RL training follow a predictable three-phase pattern?

When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.

Explore related Read →

Does gradually tightening token budgets beat fixed budget training?

Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.

Explore related Read →

Can natural language feedback overcome numerical reward plateaus?

Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.

Explore related Read →

Can chain-of-thought reasoning be learned during pretraining itself?

Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.

Explore related Read →

Novel Reward Paradigms

3 notes

Can models learn what makes research worth doing?

Can large language models be trained to recognize high-impact research directions by learning from citation patterns? This explores whether 'scientific taste'—the judgment of what work matters—is a learnable skill separate from execution.

Explore related Read →

Can reward models learn by comparing policies instead of judging them?

What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?

Explore related Read →

Do unimodal reward models actually serve all user preferences?

Standard RLHF assumes a single utility function across all users, but what happens when preferences genuinely conflict? Does averaging these opposing preferences into one model systematically fail certain groups?

Explore related Read →

Verifier-Free RL

4 notes

Can language models replace reward models with internal signals?

Recent RL research shows three independent patterns—self-judgment, belief-shift, and rich feedback—that each eliminate a component of the traditional RLHF stack. Are these patterns converging on a fundamentally different architecture for training without external verifiers?

Explore related Read →

Can models learn to judge themselves without external rewards?

Can a language model train itself by alternating between generating responses and evaluating them using only internal consistency signals? This explores whether evaluation itself can become a learnable skill without external supervision.

Explore related Read →

Can an agent's own beliefs guide credit assignment without critics?

Explore whether an agent's shifting probability estimates toward the correct answer could serve as a self-contained reward signal for long-horizon reinforcement learning, eliminating the need for separate process reward models or external verifiers.

Explore related Read →

Can environment feedback replace scalar rewards in policy learning?

Can rich tokenized feedback from environments serve as a direct learning signal for policies, without relying on compressed scalar rewards? This matters because scalar rewards discard information needed for credit assignment.

Explore related Read →

LLM-as-Reward-Engineer

1 note

Can LLMs design reward functions for reinforcement learning?

Can language models help automate the notoriously difficult task of designing reward shaping functions for sparse-reward RL, and if so, how might we structure that collaboration to work around LLMs' weaknesses in stochastic control?

Explore related Read →

Continual Agent Adaptation Architectures *(added 2026-05-18 from Arxiv/Agents Multi Architecture.md)*

2 notes

Can agents adapt without pausing service to users?

Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.

Explore related Read →

Can a separate trained curator improve skill libraries better than frozen agents?

Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.

Explore related Read →

Process Rewards and Judges

2 notes

Can judges that reason about reasoning outperform classifier rewards?

Can process reward models generate explanations about why steps are correct rather than simply classifying them? This explores whether meta-reasoning about reasoning improves both accuracy and generalization in step-level evaluation.

Explore related Read →

Can adversarial critics replace task-specific verifiers for reasoning?

Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.

Explore related Read →

Multi-Turn and Sequential RL

3 notes

Can full episode rewards per step enable better credit assignment?

Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.

Explore related Read →

Can reinforcement learning scale beyond single-turn language tasks?

Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.

Explore related Read →

How does treating LLMs as multi-step agents change what we can optimize?

Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.

Explore related Read →

Scaling and Methodology

2 notes

Does network depth unlock qualitatively new behaviors in RL?

Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.

Explore related Read →

Does RL training follow predictable scaling curves?

Can we forecast where RL training will plateau before committing full compute? ScaleRL tests whether sigmoid curves reliably predict performance ceilings across 200+ models.

Explore related Read →

Alignment and Personalization

2 notes

Can text summaries beat embeddings for personalized reward models?

When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.

Explore related Read →

Why do language models fail to act on their own reasoning?

LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?

Explore related Read →

Fine-Tuning Side Effects

3 notes

Does fine-tuning disconnect reasoning steps from final answers?

When models are fine-tuned on specific domains, do their chain-of-thought steps become less causally connected to their outputs? Three experiments test whether reasoning chains remain functionally faithful after training.

Explore related Read →

Do pretraining and fine-tuning scale independently in language models?

Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.

Explore related Read →

Can utility-weighted training loss actually harm model performance?

When engineers weight loss functions to reflect real-world costs of different errors, does this improve or undermine learning? This explores whether baking asymmetric objectives into training creates unintended side effects.

Explore related Read →

Parameter-Efficient and Alternative Tuning

5 notes

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.

Explore related Read →

Can semantic knowledge shift model behavior like reinforcement learning does?

Can textual descriptions of successful reasoning patterns, prepended as context, achieve the same distribution shifts that RL achieves through parameter updates? This matters because it could eliminate the need for expensive fine-tuning on limited data.

Explore related Read →

Can context playbooks prevent knowledge loss during iteration?

When AI systems iteratively refine their instructions and memories, do structured incremental updates better preserve domain knowledge than traditional rewriting? This matters because context degradation undermines long-term agent performance.

Explore related Read →

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Explores whether identifying and protecting task-specific parameter regions can prevent the performance degradation that occurs when fine-tuning models on multiple tasks simultaneously. This matters because it could enable safe multi-task adaptation without sacrificing individual task performance.

Explore related Read →

Can models learn to ignore irrelevant prompt changes?

Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.

Explore related Read →

Data Selection and Reasoning Architecture

2 notes

Can we train better models on less data?

Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.

Explore related Read →

Can abstractions guide exploration better than depth alone?

Does training a model to propose reasoning abstractions as intermediate subgoals help it explore diverse solution strategies more effectively than simply extending chain-of-thought depth?

Explore related Read →

Core Insights

2 notes

How should multiple reward objectives be weighted during training?

When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.

Explore related Read →

Can reward vectors be the hidden source of solution diversity?

Standard RL collapses multi-dimensional rewards into scalars before training, losing the natural structure that could drive diverse specialization. What if that vector structure itself is the diversity axis?

Explore related Read →

DRO and Self-Supervised Dense Signals (2026-05-18)

3 notes

Can we identify which tokens actually matter for reasoning?

Most tokens in an answer are determined by language patterns rather than reasoning. Is there a way to distinguish the small fraction of tokens whose certainty genuinely depends on the chain of thought?

Explore related Read →

Can rubrics and dense rewards work together without hacking?

Explores whether reward signals derived from rubrics suffer from exploitation, and whether separating rubric judgments from optimization signals could prevent this failure mode.

Explore related Read →

Can one statistical measure serve dual purposes in RL training?

Explores whether cross-rollout variance can simultaneously weight important tokens and filter low-signal queries, potentially unlocking efficiency gains in reasoning tasks without human labels.

Explore related Read →