Can natural language feedback overcome numerical reward plateaus?
Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.
Three failure modes of purely numerical RL for reasoning: (1) performance plateaus despite 8x scaling of training examples (from 4k to 32k); (2) self-reflection behaviors during RL, often celebrated as "aha moments," contribute minimally to successful problem-solving; (3) persistent failures on certain problems despite extensive trial-and-error training. The common cause: numerical feedback contains limited information about WHY a response is correct or incorrect and HOW to improve.
Critique-GRPO demonstrates that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems when provided with chain-of-thought critiques. The key is integrating both natural language feedback (NLF) and numerical feedback within online RL. The model learns from initial responses and critique-guided refinements simultaneously while maintaining exploration.
This is significant because it challenges the implicit assumption that RL's learning signal is sufficient for arbitrarily complex reasoning. Since Does reflection in reasoning models actually correct errors?, the ineffectiveness of self-reflection during RL training is predictable — the model cannot generate useful critiques of its own failures. External critiques break the ceiling because they provide the information that numerical rewards lack: specific identification of where reasoning went wrong.
The practical architecture has three components: (1) the model generates initial responses; (2) a reasoning-based reward model generates CoT critiques identifying flaws; (3) a shaping function enhances learning from valid refinements and heavily penalizes failed refinements. This approach encourages the model to integrate targeted refinements while preserving exploration.
Since Do critique models improve diversity during training itself?, the NLF mechanism works by expanding the effective exploration space — critiques point toward regions of solution space that numerical rewards cannot identify.
Semantic reward shaping as lightweight NLF: The Semantic Reward Shaping paper proposes a complementary mechanism: using a small encoder-only transformer to compute cosine similarity between generated explanations and ground-truth references. This provides a dense, semantically rich reward signal within GRPO — not as information-rich as full CoT critiques, but vastly cheaper and faster than LLM-as-judge evaluation. The approach combines semantic similarity reward with auxiliary correctness and formatting rewards, significantly improving explanation faithfulness over SFT baselines. This occupies a middle ground between brittle keyword metrics (ROUGE) and expensive LLM-based critiques — suggesting the NLF principle scales down to lightweight implementations when full CoT critique is impractical.
Textual gradients as generalized NLF: TextGrad (2406.07496) formalizes the broader principle: natural language criticism can serve as "textual gradients" propagated through arbitrary computation graphs including LLM API calls, simulators, and external solvers. Each AI system component is a node in a computation graph; textual feedback describes how variables should change to improve the system. This extends NLF from RL plateau-breaking to general AI system optimization — the same principle (informative language feedback > scalar signal) applies at the system level, not just the training level.
Inquiring lines that use this note as a source 151
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What cognitive capabilities do agents need to internalize social feedback?
- Can unified policies handle negative feedback and critique transformation simultaneously?
- How do self-generated preference pairs from a strong teacher compare to human feedback?
- Can systems recognize and abstain on judgments rather than hallucinating preferences?
- Why does combining natural language with numerical scores improve prediction accuracy?
- Why does scaling reasoning tokens fail to improve unfamiliar tasks?
- Can checklist-based rewards fix judgment problems in RL training?
- How does turn-level working alliance inference enable real-time therapist feedback?
- Does therapy environment difficulty calibration affect RL policy learning quality?
- Can hierarchical reinforcement learning manage structured therapy conversation phases?
- How does execution-guided critique differ from abstract action evaluation?
- What role does natural language play in breaking reinforcement learning performance plateaus?
- Why does retrieval chain training unlock scaling laws in QA?
- How does entropy collapse in reinforcement learning differ from entropy maintenance in graph reasoning?
- Does in-distribution reward model performance hide failures from context shift?
- How does partial information exposure create feedback loops that deepen knowledge gaps?
- How do intrinsic motivation principles explain why generating novel challenges improves learning?
- What makes bilevel metacognition architectural rather than emergent in current systems?
- How do reward model ensembles improve robustness to miscalibration?
- How does prompt context decomposition reveal hidden reward model failures?
- Why do reward models learn surface-level shortcuts instead of genuine quality assessment?
- Does the timing of AI feedback relative to user reasoning change its effectiveness?
- What makes trajectory more actionable than absolute scores for human moderators?
- How do critique models prevent policy entropy collapse during reasoning training?
- Why does natural language feedback break performance plateaus that numerical rewards alone cannot?
- What makes few-shot prompting sufficient for critique-to-preference transformation without fine-tuning?
- Can solution traces substitute for process-level reward signals in math reasoning?
- Can reward model training be automated without changing feedback mechanisms?
- What breaks when you apply reinforcement learning after supervised fine-tuning?
- Can offline reinforcement learning improve dialogue policy baseline performance?
- What information do next-state signals contain beyond what scalar rewards capture?
- Can scaling predictions become reliable if improvements are continuous not sudden?
- Do task-specific heuristics improve gradually or appear suddenly at scale?
- Can energy minimization replace reasoning-specific reinforcement learning for system 2 thinking?
- Do outcome-only reward signals miss step-level errors that compound later?
- Does policy entropy collapse represent the main bottleneck in reasoning-focused RL scaling?
- Can reinforcement learning add missing domain knowledge to fine-tuned reasoning models?
- Why does policy entropy collapse limit reasoning and dialogue RL scaling?
- How do graduated phase rewards emerge complex dialogue behavior from simple objectives?
- Can meta-reinforcement learning explain why this bias pattern emerges rationally?
- What makes process-level supervision better than outcome-only reward signals?
- Can subjective tasks be delegated without human feedback loops?
- How does process-focused feedback compare to outcome-focused feedback in skill training?
- How do contrasting examples improve AI feedback quality over generic suggestions?
- Can co-evolved critics truly circumvent static evaluator limitations in self-improvement?
- How does modularity in reward and policy design enable goal generalization?
- How do probability-based rewards compare to self-consistency as training signals for reasoning?
- How does reward model training permit spurious correlations in scoring?
- Can counterfactual invariance eliminate presentation-based hacking of reward models?
- Does reinforcement learning learn optimal per-turn reasoning discipline?
- Can critic model trios evaluate reasoning quality more reliably than outcome rewards alone?
- Is reward propagation in RL formally dual to cause inference in memory?
- Can intrinsic reward signals extend beyond mathematics to medicine and law?
- Can reward models trained for engagement fix the informativeness problem?
- Could reward signals incentivize active intent discovery over passive response generation?
- How does implicit feedback structure differ from explicit ratings mathematically?
- How do semantic reward shaping approaches compare to full critique models?
- Can textual gradients generalize natural language feedback across computation graphs?
- What information do numerical rewards fail to provide for reasoning tasks?
- How does negative reinforcement redistribute probability without guiding toward correct answers?
- Is elaborate reward shaping necessary if the pretrained prior already contains good solutions?
- Do depth thresholds correspond to transitions between procedural and strategic learning?
- Can architectural changes like decoupling intent understanding help overcome next-turn reward limitations?
- Why do generative reward models produce more interpretable evaluations than scalar scores?
- Can UCB-style bonuses over outcome space prevent policy entropy collapse?
- How do Q-value models improve action selection compared to value models?
- Can negative feedback through critiques achieve the same steering flexibility as positive preferences?
- How do reward model biases cascade into downstream optimization failures?
- Can programmatic meta-reasoning rewards operationalize agentic process supervision?
- What information-theoretic framework explains why process rewards beat outcome only?
- Can preference optimization reduce overthinking without sacrificing accuracy?
- How can we measure whether process rewards actually align with reasoning quality?
- Do current math benchmarks measure outcomes or rhetorical plausibility?
- Does attention bias in transformers compound with training-level reward insensitivity?
- Can we distinguish between genuine alignment and response quality bias in reward signals?
- How do task-type perceptions like chat versus reasoning guide different reward strategies?
- How does the pretrained prior set a capability ceiling for reward model exploration?
- How does reinforcement learning differ from chain-of-thought distillation?
- Can negative reinforcement alone match full RL performance on domain tasks?
- Why does policy entropy collapse predict sigmoid saturation points?
- How do reward models benefit from extended thinking during evaluation scoring?
- Can structured natural language feedback outperform scalar rewards in RL?
- What causes length bias in language model reward models?
- Can AI evaluation match human judgment quality in structured domain tasks?
- Does reinforcement learning preserve reasoning quality better than supervised fine-tuning?
- Can runtime interventions like meta-cognitive prompting work where training interventions fail?
- Do reward reasoning models with chain-of-thought reasoning evaluate prompts better?
- Can decomposing rewards into prompt-free and prompt-related components fix this blindspot?
- What makes pretraining composition more important than reward engineering?
- Can reward design fix the conflict between reasoning accuracy and abstention calibration?
- Can behavior-level emotion rewards maintain factual reliability in emotional contexts?
- What distinguishes intrinsic metacognition from extrinsic human-designed loops?
- Can emotion-grounded rewards replace coarse bonus signals in hierarchical dialogue RL?
- How does reinforcement learning on outcomes reinforce template-matching rather than computation?
- Can environmental rewards directly refine natural language descriptions of actions?
- Why does imitation learning alone plateau without outcome-based refinement?
- Can reinforcement learning fix the reasoning gaps that supervised fine-tuning misses?
- Why do sparse outcome rewards fail to credit correct tool calls in failed trajectories?
- What reward mechanisms make thinking-based compression budget-controllable and reliable?
- Can evaluation trajectories and interaction histories replace single-answer scoring?
- Can process supervision improve agentic RL through meta-reasoning rewards?
- How does belief-shift reward compare to curiosity-driven and process reward approaches?
- Why does belief-shift reward enable smaller models to match larger baselines?
- Can binary judge feedback replace external reward signals for skill learning?
- How do checklists prevent reward models from exploiting superficial response artifacts?
- Why do standard process reward models struggle with branching reasoning traces?
- How do high-entropy tokens concentrate reinforcement learning's effect?
- How does memory folding enable agents to reconsider strategies mid-task?
- Why does self-segmentation into chunks-of-thought matter for reward models?
- Do self-supervised process reward models scale better than human annotation?
- Can in-context reinforcement learning match human sample efficiency on real problems?
- How can interactive evaluation avoid replicating fragmentation problems from response-centered benchmark culture?
- Why does scalarization of rewards fail for multi-objective GRPO training?
- How should multi-objective post-training balance competing behavioral goals?
- How does tree-search topology convert outcome rewards into intermediate supervision?
- Can the exploration ceiling be raised beyond what pretraining established?
- Does outcome-based reinforcement learning improve explanation faithfulness?
- What other downstream metrics could serve as RL reward sources?
- Can verifiable rewards during pretraining replace costly human preference labeling?
- Why do outcome-based rewards train language models to over-engage rather than abstain?
- How do relational reward signals compare to absolute preference encodings in RL?
- Do personalized reward models work better than one-size-fits-all approaches?
- Can language models function as implicit process reward models through retrospection?
- How does in-context feedback integration differ from learned reward signals?
- Why does policy entropy collapse when scaling RL for reasoning?
- Does policy entropy collapse in formal reasoning produce the same outcome in social reasoning?
- Can entropy regularization or critique models prevent search strategy collapse during RL training?
- What does process supervision reveal about step-level reasoning versus outcome rewards?
- Can early experience replace external rewards as a learning signal?
- Why do veto mechanisms on critical dimensions prevent collapse into exploitable reward modes?
- Why does information asymmetry between teacher and student enable effective feedback learning?
- Can held-out validation gates prevent optimizer hallucinations in skill proposals?
- How do reward models guide inference-time compute allocation decisions?
- What role does task structure play in rewarding delayed thinking?
- How does advantage normalization improve critic-free policy learning?
- Why does gradient discarding limit standard policy clipping?
- Can tree-GRPO work with extremely noisy or sparse outcome reward signals?
- What are the actual limits of sibling comparison versus trained process reward models?
- What causes policy entropy collapse in reasoning-focused reinforcement learning?
- What makes binary rewards more effective than richer reward signals?
- When does a task lack a meaningful multi-dimensional reward structure?
- Can RL directly optimize attention distributions instead of text generation?
- Can rich environment feedback replace human preference labels entirely?
- Does semantic diversity in output space compete with reward-component diversity?
- Does pairwise self-judgment avoid reward model scaling problems?
- How do internal model mechanisms escape token-level reinforcement signals?
- How do binary comparisons constrain reward scale in multi-user preference learning?
- Why does externalizing bookkeeping raise effective feedback compute?
- Can scaling data alone solve performance gaps on long-tail concepts?
- Can open-world evaluations become a scalable paradigm without becoming the next benchmark trap?
- Can trajectory structure replace hand-annotated process reward models entirely?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do critique models improve diversity during training itself?
Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
extends: NLF is the mechanism by which critique-driven exploration improves diversity
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
explains: why self-reflection fails to break plateaus; external critique is needed
-
Does revising your own reasoning actually help or hurt?
Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.
directly supports: external NLF breaks plateaus; internal reflection does not
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
connects: NLF may work by re-expanding entropy in the specific regions where the model has collapsed
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
- Teaching Large Language Models to Reason with Reinforcement Learning
- Understanding and Mitigating Premature Confidence for Better LLM Reasoning
- Reward Reasoning Model
- Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- RM-R1: Reward Modeling as Reasoning
- Efficient Reinforcement Learning via Large Language Model-based Search
Original note title
natural language feedback breaks rl performance plateaus that scaling numerical rewards alone cannot resolve