How does modularity in reward and policy design enable goal generalization?
This explores whether breaking reward signals and policies into separate, recombinable parts — rather than one monolithic scalar or one fixed policy — is what lets systems carry skills to new tasks they weren't trained on.
This reads the question as being about composition: when reward and policy are built from separable parts instead of a single lumped objective, those parts can be recombined for situations the training never saw. The corpus makes a surprisingly consistent case for this from several angles. The starting move is to stop treating reward as one number. One line of work shows that the feedback an agent receives actually decomposes into two orthogonal channels — an *evaluative* signal (how good was that action) and a *directive* signal (which way should it change) — and that a scalar reward can only carry the first, silently discarding the second Can scalar rewards capture all the information in agent feedback?. Once you see reward as modular like this, you can recover the lost channel: natural-language critiques break performance plateaus precisely because they restore the "why it failed and how to fix it" information that numbers can't encode Can natural language feedback overcome numerical reward plateaus?.
The same modularity shows up as literally adding reward terms together. Binary correctness rewards quietly teach models to guess confidently, but bolting on a Brier-score term as a *second* component mathematically guarantees you optimize accuracy and calibration jointly, with no trade-off Does binary reward training hurt model calibration?. That's the whole modularity argument in miniature: a separable objective fixes a failure that the monolithic objective baked in. And the reward function itself can be a composed, swappable artifact rather than something hand-tuned — LLMs can generate reward-shaping functions by first solving a simplified, deterministic version of a problem and converting that plan into shaping signals for the real stochastic task Can LLMs design reward functions for reinforcement learning?.
Where this connects to *generalization* specifically is in how reward models are framed. Instead of learning an absolute scale of "good," a reward model can be redefined as a policy *discriminator* — it scores how close a policy sits to a chosen target. Because the target is a slot you fill in rather than a fixed preference baked into the weights, the same pre-trained reward model transfers across task formulations it never saw labels for Can reward models learn by comparing policies instead of judging them?. The reward is modular in the deepest sense: the objective is parameterized, not hardcoded.
Policy-side modularity follows the same logic. Meta-agents trained with RL can assemble a *fresh* multi-agent architecture per query rather than reusing one fixed workflow, treating sub-agents as composable building blocks selected on the fly Can AI systems design unique multi-agent workflows per individual query?. And policies generalize better when their *learning* is modular too — processing successful trajectories as concrete demonstrations but failures as abstracted lessons (two different update rules for two different signals) beats treating every episode the same way Should successful and failed episodes be processed differently?. The thread running through all of these: a monolithic reward or policy memorizes one task well; a decomposed one keeps the pieces you can carry somewhere new. The thing you didn't expect to learn is that "goal generalization" here isn't really about bigger models — it's about refusing to collapse rich feedback into a single number in the first place.
Sources 7 notes
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.
POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.
FlowReasoner demonstrates that meta-agents trained with reinforcement learning and external execution feedback can generate unique multi-agent architectures for each user query, optimizing across performance, complexity, and efficiency—moving beyond fixed task-level workflow templates.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.