How do verifier-free and adversarial approaches compare in extending reasoning RL?
This explores two ways to train reasoning models with reinforcement learning when you don't have a reliable answer-checker: adversarial setups (a critic learns to tell good answers from bad) versus verifier-free setups (the reward comes from the model's own probabilities), and what each buys you.
This explores two escape routes from the same bottleneck: reasoning RL normally needs a verifier — a rule or grader that says "this answer is correct" — and that's expensive or impossible outside math and code. The corpus has two distinct ways around it. The adversarial route, RARO Can adversarial critics replace task-specific verifiers for reasoning?, stages a game where a critic learns to discriminate expert answers from the policy's own attempts, so the reward signal is *learned* rather than supplied — and it holds up across domains as varied as Countdown, math, and poetry. The verifier-free route, VeriFree Can reasoning improvement work without answer verification?, skips the judge entirely: it asks how likely the known reference answer becomes *given* the reasoning the model just generated, and uses that probability as both reward and weight. Same goal — extend reasoning RL into general domains — but one builds an opponent, the other reads its own confidence.
The interesting tension is what they're optimizing for. An adversarial critic creates pressure that scales like a real verifier without you hand-writing one, which matters when "correct" is fuzzy (poetry has no unit test). The likelihood approach is cheaper and more direct but leans on having a reference answer to condition on. Both sit in a wider family of "reward without annotation" methods: information-theoretic dense rewards that score each reasoning step by its contribution to the final answer Can we reward reasoning steps without human annotation?, generative reward models that reason *before* judging and beat discriminative graders on a fraction of the labels Can generative reasoning beat discriminative models with less training data?, and execution-free code verification that hits 93% reliability without ever running the code Can structured reasoning replace code execution for RL rewards?. Read together, these say the verifier was never the point — the *signal* was, and there are many ways to manufacture it.
Here's the unsettling part neither approach escapes. A separate line of work argues RLVR — the verifier-based method these are trying to generalize — doesn't actually expand what a model can reason about; pass@k analysis shows base models match or beat RL-tuned ones at high sampling, so RL is sharpening access to solutions already latent in the model, not creating new ones Does RLVR actually expand what models can reason about?. The complementary framing is even sharper: RL post-training teaches a model *when* to deploy reasoning, not *how* to reason — hybrid models recover 91% of the gains by routing tokens alone Does RL post-training create reasoning or just deploy it?. If that's right, then the verifier-free vs. adversarial contest is a contest over *cheaper ways to surface existing capability across more domains* — a real and useful win — not over expanding the reasoning frontier itself.
So the honest comparison isn't "which extends reasoning further" but "which manufactures a trustworthy reward more cheaply, and where." Adversarial critics generalize to domains with no notion of a checkable answer; likelihood-based methods generalize wherever you have a reference to condition on, at lower cost. Both inherit the ceiling RLVR already hit — and a parallel warning worth knowing: binary correctness rewards quietly wreck calibration, training models to guess confidently, which a learned critic or a probability-based signal could either fix or amplify depending on how it's shaped Does binary reward training hurt model calibration?.
Sources 8 notes
RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.
VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.
L2T uses PAC-Bayes bounds and Fisher information to compute per-episode rewards measuring each step's contribution to correctness. This annotation-free approach matches dense feedback quality while eliminating the cost of outcome-only methods that produce 2x excess tokens.
GenPRM and ThinkPRM reframe process supervision as generative tasks with CoT reasoning before judgment, achieving superior performance on far fewer labels. A 1.5B GenPRM beats GPT-4o; ThinkPRM uses only 1% of PRM800K labels to surpass full-dataset discriminative verifiers.
Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.