Can environmental rewards directly refine natural language descriptions of actions?
This explores whether feedback from an environment (success/failure, reward signals) can be used to directly sharpen the language an agent uses to describe its actions — rather than just nudging a scalar score up or down.
This explores whether environmental rewards can do more than tune a number — whether they can reach into and refine the actual language an agent uses to describe what it's doing. The corpus suggests the answer is yes, but only because researchers have realized that a scalar reward throws away most of what feedback actually contains. The cleanest statement of this is the finding that agent feedback splits into two orthogonal channels: an *evaluative* signal (how well the action did) and a *directive* signal (how it should change) Can scalar rewards capture all the information in agent feedback?. A single number captures only the first. The directive part — the part that could rewrite a natural-language description of an action — survives only if you keep the feedback in richer form.
Several notes show that richness paying off. Critique-GRPO demonstrates that models stuck on a numerical-reward plateau start solving problems again once they receive chain-of-thought *critiques* — because the number never told them *why* they failed, only that they did Can natural language feedback overcome numerical reward plateaus?. Reflexion goes further and closes the loop with the environment directly: it takes an unambiguous success/failure signal and converts it into a written self-diagnosis stored in episodic memory, so the agent improves across episodes without ever updating its weights Can agents learn from failure without updating their weights?. That's environmental reward refining a natural-language description of behavior in the most literal sense — the binary signal triggers, but the *language* is what carries the learning forward.
The corpus also hints at why you'd want the reverse direction — language structuring the reward — which turns out to be the same insight seen from the other side. Checklist-based methods decompose a fuzzy instruction into verifiable sub-criteria, so the reward signal itself becomes a set of natural-language statements you can check Can breaking down instructions into checklists improve AI reward signals?. Emotion-based RL uses a simulated user's emotional trajectory as the reward, grounding the signal in something semantically meaningful rather than an abstract scalar Can emotion rewards make language models genuinely empathic?. And RLAG rewards not just the answer but the *rationality of the explanation* — explicitly optimizing the quality of the language describing the reasoning, not only the token-level outcome Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?.
The thing you didn't know you wanted to know: the field is quietly converging on a view that the scalar reward was always a lossy compression of feedback that is *natively linguistic*. Models can even internalize the evaluator — learning to write their own self-assessment in the unused sequence space after their output, at zero inference cost Can models learn to evaluate their own work during training?. So "can environmental rewards refine natural-language descriptions of actions" turns out to be less a niche technique and more a reframing of what reward signals are for: the number tells you *whether*, the language tells you *how*, and the recent work is about stopping the system from discarding the second half.
Sources 7 notes
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.
RLVER uses a simulated user's emotion trajectory as an RL reward signal, enabling GRPO to deliver stable empathy improvements while maintaining dialogue quality—countering the typical trade-off between preference optimization and conversational grounding.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.