Can judges trained on both verifiable and non-verifiable tasks transfer across domains?
This explores whether evaluator models (LLM 'judges' and reward models) that learn from a mix of checkable tasks — where answers can be verified — and open-ended ones — where they can't — actually carry their judgment skill into new domains, rather than overfitting to one kind of task.
This explores whether evaluator models — the LLM 'judges' and reward models that grade other models' outputs — can be trained on both verifiable tasks (math, code, anything with a checkable answer) and non-verifiable ones (writing, instruction-following, subjective quality) and then generalize to domains they weren't trained on. The corpus doesn't answer this with a single clean experiment, but several threads converge on a hopeful and specific picture: the bridge between verifiable and non-verifiable evaluation is the act of *reasoning before judging*, and that reasoning skill is what transfers.
The most direct evidence is the move to make judges *think*. When judges are trained with reinforcement learning to reason through an evaluation — by recasting judgment as a verifiable problem with synthetic right/wrong pairs — they stop leaning on exploitable surface cues and start reasoning about substance Can reasoning during evaluation reduce judgment bias in LLM judges?. This matters because the alternative is fragile: ordinary LLM judges fall for fake citations and pretty formatting in zero-shot attacks that require no model access at all Can LLM judges be fooled by fake credentials and formatting?, Can LLM judges be tricked without accessing their internals?. A judge that only pattern-matches surface features won't transfer anywhere; a judge that reasons has something domain-independent to carry.
The deeper trick the corpus surfaces is that the verifiable/non-verifiable divide is softer than it looks — you can often *manufacture* verifiability inside a soft domain. Checklist methods decompose a subjective instruction-following task into many small checkable sub-criteria, which both improves performance and stops the reward model from overfitting to superficial artifacts Can breaking down instructions into checklists improve AI reward signals?. Generative process reward models that reason step-by-step before scoring beat discriminative graders with orders of magnitude less labeled data Can generative reasoning beat discriminative models with less training data?. And entirely verifier-free approaches reach into general domains: RARO uses an adversarial critic that discriminates expert from policy answers across math, code, *and* poetry without any task-specific verifier Can adversarial critics replace task-specific verifiers for reasoning?, Can reasoning emerge from expert demonstrations alone?, while VeriFree replaces answer-checking with the likelihood of a reference answer and matches verifier-based methods on broad benchmarks like MMLU-Pro and GPQA Can reasoning improvement work without answer verification?.
There's also a clean demonstration that the *evaluation apparatus itself* can transfer: MAJ-EVAL extracts stakeholder personas from domain documents and runs a structured debate that generalizes across summarization and dialogue without manual redesign Can personas extracted from documents generalize across evaluation tasks?. That's cross-domain transfer of a judging method, not of a single judge model — a useful reframing of what 'transfer' can even mean here.
The caution worth carrying away: transfer is real for *reasoning and method*, but not magic. RLVR sharpens sampling toward solutions the base model already had rather than expanding its true boundary Does RLVR actually expand what models can reason about?, and imitation training shows you can fool evaluators by copying confident style while closing no actual capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. So a judge can *look* like it transferred when it has only learned a domain's surface register. The thing you didn't know you wanted to know: the question of cross-domain judge transfer is really the question of whether your judge learned to reason or learned to recognize — and the same biases that make naive judges hackable are exactly the ones that fail to generalize.
Sources 11 notes
Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.
Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.
Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.
GenPRM and ThinkPRM reframe process supervision as generative tasks with CoT reasoning before judgment, achieving superior performance on far fewer labels. A 1.5B GenPRM beats GPT-4o; ThinkPRM uses only 1% of PRM800K labels to surpass full-dataset discriminative verifiers.
RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.
RARO recovers implicit reward functions from expert demonstrations through adversarial co-training between a reasoning policy and relativistic critic. This approach matches verifier-based RL performance on reasoning tasks while extending to domains lacking automated verification.
VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.
MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.