Can verifier-free RL work without manual preference labels or task-specific training?

This explores whether reinforcement learning can improve a model's reasoning without three usual crutches: a hand-built verifier, human preference labels, and training tailored to one task — and the corpus shows several independent routes to exactly that.

This explores whether RL can improve reasoning without a hand-built verifier, human preference labels, or task-specific training — and the corpus is unusually direct about it: by late 2025 the field had converged on the idea that the signal you used to outsource can be manufactured from the model's own computations. The clearest map of this is the observation that verifier-free RL splits into Can language models replace reward models with internal signals? three substitutable patterns — pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces the explicit reward. That framing is the doorway: each of the other papers here is essentially one instantiation of one of those three moves.

The most literal answer to your question is VeriFree, which drops answer-checking entirely and uses the Can reasoning improvement work without answer verification? conditional probability of a reference answer given the model's reasoning trace as both the reward and the training weight — and matches verifier-based methods on hard general-domain benchmarks. But notice it still leans on a reference answer. If you want to remove even that, Test-Time RL goes further: it generates its reward by Can models improve themselves using only majority voting? majority-voting across repeated samples on unlabeled data, betting that consensus answers tend to be correct and bootstrapping from there. And RLSF removes external signal another way, ranking the model's own reasoning traces by Can model confidence work as a reward signal for reasoning? answer-span confidence to synthesize preferences with no human labels at all.

A different family manufactures the missing supervision through adversarial pressure rather than self-agreement. RARO sets up a game where Can adversarial critics replace task-specific verifiers for reasoning? a critic learns to discriminate expert answers from the policy's, which works across domains as varied as math and poetry without any domain-specific verifier. Ctx2Skill scales this into a three-role self-play loop — a Challenger that escalates difficulty, a Judge that issues binary verdicts, and policies that Can language models learn skills without human supervision? co-evolve by editing their own skills in natural language. The recurring lesson in this branch is that you have to balance the adversarial pressure against a generalization safeguard, or the whole loop collapses.

The interesting part is the catch that ties these together. Two papers warn that cheap, label-free reward shapes the model in ways you might not want: binary correctness rewards Does binary reward training hurt model calibration? push models toward confident guessing because nothing penalizes a confident wrong answer, and RLHF more broadly drives models toward Does RLHF make language models indifferent to truth? indifference to truth — they still internally represent the right answer, they just stop committing to expressing it. So 'does it work without labels' has a quieter companion question: does the substitute signal preserve calibration? That's why several of these methods bolt on a second objective (a Brier-score term, confidence ranking, information-theoretic per-step credit like Can we reward reasoning steps without human annotation? L2T) rather than trusting a single self-generated number.

One honest boundary worth carrying away: even when verifier-free RL works, it may not be doing what you think. The Does RLVR actually expand what models can reason about? pass@k analysis suggests this whole family of methods mostly sharpens sampling toward solutions the base model could already reach — it makes the model find its good answers more reliably, but doesn't expand the set of problems it can solve. So yes, verifier-free RL without labels or task-specific training genuinely works; just don't expect it to teach the model something it never knew.

Sources 10 notes

Can language models replace reward models with internal signals?

Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.

Can reasoning improvement work without answer verification?

VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can adversarial critics replace task-specific verifiers for reasoning?

RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.

Can language models learn skills without human supervision?

Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does RLHF make language models indifferent to truth?

RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.

Can we reward reasoning steps without human annotation?

L2T uses PAC-Bayes bounds and Fisher information to compute per-episode rewards measuring each step's contribution to correctness. This annotation-free approach matches dense feedback quality while eliminating the cost of outcome-only methods that produce 2x excess tokens.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-RL researcher auditing whether verifier-free RL has genuinely decoupled from manual preference labels and task-specific training, or whether the apparent success masks hidden dependencies.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A library of ~12 papers converged on three substitutable patterns for manufacturing supervision without hand labels: pairwise self-judgment (VeriFree, 2025-05), majority-vote consensus (Test-Time RL, 2025-04), and confidence-based preference ranking (RLSF, 2025-05). Adversarial self-play variants (RARO, Ctx2Skill, 2025-11/2026-04) scale across domains. Critical caveat: binary correctness rewards degrade calibration (2025-07); RL-optimized models exhibit "machine bullshit"—internal truth-retention but surface indifference to accuracy (2025-07, arXiv:2507.07484). Pass@k analysis shows these methods sharpen sampling toward solutions the base model could already find, not expand reasoning *capability* boundaries (2025-04, arXiv:2504.13837).

Anchor papers (verify; mind their dates):
- arXiv:2505.21493 (2025-05): Reinforcing General Reasoning without Verifiers
- arXiv:2504.16084 (2025-04): TTRL: Test-Time Reinforcement Learning
- arXiv:2507.07484 (2025-07): Machine Bullshit—emergent disregard for truth under RLHF
- arXiv:2604.27660 (2026-04): From Context to Skills (adversarial self-play scaling)

Your task:
(1) RE-TEST THE "NO LABELS, NO TASK-SPECIFIC TRAINING" CLAIM. For each substitutable pattern (self-judgment, consensus, confidence ranking, adversarial loops), determine whether newer models, larger-scale training, or tighter evaluation harnesses have since relaxed the hidden dependency on reference answers, domain-specific scaffolding, or implicit task structure. Separate the durable claim (cheap supervision *exists*) from the perishable one (it generalizes freely). Cite what changed it.
(2) Surface the strongest work from the last ~6 months contradicting the "capability-neutral" finding—does any recent method actually *expand* reasoning boundaries verifier-free, or do all variants still sharpen sampling?
(3) Propose 2 research questions that assume the regime *has* shifted: (a) Can calibration-aware objectives (Brier score, info-theoretic credit) and capability expansion both be decoupled from labeled preference data? (b) What is the minimal structural prior (e.g., problem decomposition, explicit intermediate steps) required to scale verifier-free RL to genuinely novel problem classes?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can verifier-free RL work without manual preference labels or task-specific training?

Sources 10 notes

Next inquiring lines