INQUIRING LINE

Can binary judge feedback replace external reward signals for skill learning?

This explores whether a simple thumbs-up/thumbs-down verdict from a judge model can stand in for the externally engineered reward signals that reinforcement learning usually depends on when teaching a model new skills.


This explores whether a simple thumbs-up/thumbs-down verdict from a judge can do the job we normally hand to carefully engineered external reward signals. The corpus says: yes, binary judging can manufacture the missing feedback loop — but a recurring finding is that binary alone leaves real learning signal on the table. The clearest 'yes' case is self-play, where a Challenger raises difficulty as a curriculum and a neutral Judge issues binary verdicts that serve as the reward, letting language skills co-evolve with no human supervision at all Can language models learn skills without human supervision?. So a binary judge genuinely can replace an external reward source — the question is what it costs you.

The cost shows up first in calibration. Training on pure binary correctness rewards quietly teaches the model to guess confidently, because a confident wrong answer is penalized no differently than a hesitant one; adding a proper scoring term (the Brier score) restores accuracy *and* calibration together Does binary reward training hurt model calibration?. That hints at a deeper structural limit: a scalar verdict captures *how well* an action did but discards *how it should change*. One line of work shows feedback actually decomposes into evaluative and directive channels, and binary scores throw the directive half away agent-next-state-signals-decompose-into-evaluative-and-directive-informa. The same gap is why models stuck on a reward plateau suddenly improve when handed a chain-of-thought critique instead of a number — the number never said *why* it failed Can natural language feedback overcome numerical reward plateaus?.

Here's the turn you might not expect: even when feedback stays binary or coarse, you can extract far more from it than a single gradient step. Negative verdicts alone — just suppressing wrong trajectories — can match or beat full RL while preserving the answer diversity that positive-only training collapses Does negative reinforcement alone outperform full reinforcement learning?. Processing successes and failures *asymmetrically* (wins as concrete demonstrations, losses as abstracted lessons) pushes performance further still Should successful and failed episodes be processed differently?. So the binary signal isn't the bottleneck so much as how you metabolize it.

And if the question is really 'can we drop the external reward source entirely,' the corpus has a whole family of answers that route around the judge altogether. Models can learn to evaluate their own work in the unused space after their output, computing their own reward at zero inference cost Can models learn to evaluate their own work during training?. Agents can treat the consequences of their own actions as supervision and match expert-trained baselines on half the data Can agents learn from their own actions without external rewards?. An agent's own shifting belief toward a solution becomes a dense intrinsic reward with no critic network at all Can an agent's own beliefs guide credit assignment without critics?. Tree search can generate process-level quality signals equivalent to human labels Can tree search replace human feedback in LLM training?, and rich environment feedback can be self-distilled into dense credit so the policy becomes its own process-reward model Can environment feedback replace scalar rewards in policy learning?.

The synthesis worth carrying away: a binary judge *can* replace external rewards — self-play proves the loop closes — but the frontier has largely moved past the binary-vs-external framing. The richer question the corpus is actually answering is how to recover the *directive*, why-did-it-fail information that any scalar verdict discards, whether by adding a calibration term, breaking a holistic judgment into verifiable checklist sub-criteria Can breaking down instructions into checklists improve AI reward signals?, or letting the model become its own evaluator.


Sources 12 notes

Can language models learn skills without human supervision?

Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Next inquiring lines