INQUIRING LINE

Can unified policies handle negative feedback and critique transformation simultaneously?

This explores whether a single learned policy can do two jobs at once — learn from what went wrong (negative feedback) and turn criticism into something actionable (critique transformation) — rather than splitting those into separate components.


This explores whether a single learned policy can do two jobs at once: learn from what went wrong, and turn criticism into something the system can act on. The corpus suggests the question hides a deeper one — what kind of information feedback actually carries, and whether you lose anything by collapsing it. The most useful insight here is that "negative feedback" and "critique" aren't the same thing. Agent feedback splits into two orthogonal channels: an *evaluative* signal (how bad was this?) and a *directive* one (here's how to fix it). A scalar reward captures the first and throws away the second Can scalar rewards capture all the information in agent feedback?. That distinction is exactly why critique transformation matters — it recovers the directional information a thumbs-down loses.

There's strong evidence the two can be unified. The cleanest example: language models converting a user's complaint — "doesn't look good for a date" — directly into a positive preference like "prefer more romantic," so a retrieval system finds better matches without retraining Can language models bridge the gap between critique and preference?. That's negative feedback and critique transformation happening in one pass. On the recommender side, the unified-policy case is even more direct: folding what-to-ask, what-to-recommend, and when into a single policy beats optimizing them separately, because separation blocks gradient signals from informing each other Can unified policy learning improve conversational recommender systems?. The argument for unification is the same in both: keeping the jobs apart wastes information that wants to flow between them.

The reinforcement-learning side of the corpus shows why critique-as-transformation outperforms raw negative reward. Models stuck on a numerical-reward plateau start producing correct solutions once you hand them a chain-of-thought critique explaining *why* they failed — the number alone never carried that Can natural language feedback overcome numerical reward plateaus?. A related method skips the external reward model entirely: feed the policy retrospective evidence of its own mistakes in-context and it acts as its own process critic, converting rich feedback into dense gradients Can environment feedback replace scalar rewards in policy learning?. So a unified policy doesn't just *tolerate* both signals — the directive critique is what makes the negative signal teachable.

Here's the surprise the corpus offers: negative feedback alone is more powerful than people assume. Training on only negative samples — suppressing wrong trajectories — matches or beats full RL, because it preserves solution diversity where positive-only reinforcement collapses it by piling probability onto a few winners Does negative reinforcement alone outperform full reinforcement learning?. Critique models reinforce this from another angle: injecting step-level critique during training keeps exploration diverse and prevents premature convergence Do critique models improve diversity during training itself?. And there's a hint that the two signal types may want *asymmetric* handling, not identical treatment — successes stored as concrete demonstrations, failures abstracted into lessons Should successful and failed episodes be processed differently?. That's the one caution against naive unification: a single policy can carry both, but it may need to process them differently inside.

The honest limit: the corpus has no paper benchmarking a *single* policy explicitly doing negative-feedback learning and critique-transformation side by side. What it gives you instead is the architecture of the answer — feedback decomposes into evaluation plus direction, unification beats separation when it lets those channels cross-inform, and pure self-improvement without any external critique signal eventually stalls on its own circularity Can models reliably improve themselves without external feedback?. The pieces are all here; no one note assembles them for you.


Sources 9 notes

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can language models bridge the gap between critique and preference?

Few-shot LLM prompting can convert natural negative feedback like "doesn't look good for a date" into positive preferences like "prefer more romantic," enabling retrieval systems to find better-matching recommendations without fine-tuning.

Can unified policy learning improve conversational recommender systems?

Research shows that formulating attribute-asking, item-recommending, and timing decisions as a single graph-based RL policy achieves better joint optimization than isolated components. Separation prevents gradient signals from informing one another and fails to optimize conversation trajectory holistically.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL/LLM researcher tasked with re-evaluating whether unified policies can handle negative feedback and critique transformation simultaneously—a question a curated library explored across 2021–2026. Treat these findings as dated; your job is to test whether newer models, methods, or tooling have shifted the constraints.

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2026:
• Negative and evaluative feedback decompose into two orthogonal channels: evaluative (how bad?) and directive (how to fix?); scalar rewards collapse directional information (2021–2024).
• Unified policies for negative feedback + critique transformation outperform separate pipelines because gradient signals cross-inform; language models convert complaints into preferences in one pass (2021–2024, ~2109.07576).
• Natural-language critique breaks RL performance plateaus where numerical reward stalls; retrospective mistake evidence in-context enables dense credit assignment (2024–2025, ~2411.16579).
• Negative reinforcement alone (suppressing wrong trajectories) matches or exceeds full RL by preserving solution diversity; critique models prevent premature convergence during training (2025–2026, ~2506.01347).
• Self-improvement without external critique signal eventually circularizes and stalls; asymmetric processing (concrete demos for successes, abstract lessons for failures) may outperform identical handling (2024–2026, ~2412.02674, ~2601.20802).

Anchor papers (verify; mind their dates):
• arXiv:2109.07576 (2021-09): Critique-to-preference transformation in conversational recommendation.
• arXiv:2411.16579 (2024-11): Critique models with test- and training-time supervision for LLM reasoning.
• arXiv:2506.01347 (2025-06): Negative reinforcement effectiveness in LLM reasoning.
• arXiv:2601.20802 (2026-01): Self-distillation in RL—asymmetric trajectory processing.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Claude 3.5+), methods (GRPO, test-time RL, verifiable rewards), tooling (long-context in-context learning, multi-turn scaffolding), or evaluation (reasoning benchmarks, long-horizon tasks) have since RELAXED or OVERTURNED it. Separate the durable question—can critique and negation unify?—from the perishable limitation. Cite what resolved or still constrains.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from Jan–Jun 2026 if any. If the field has moved toward asymmetric processing or away from unified policies, name it.
(3) Propose 2 research questions that ASSUME the regime may have shifted: (a) Does test-time critique refinement (2025–2026) make training-time unification obsolete? (b) Can verifiable meta-reasoning rewards (2025) bypass the need to transform critique into a learnable signal?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines