J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

Paper · arXiv 2505.10320 · Published May 15, 2025

The progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought reasoning, motivating the need to find the best recipes for training such models to think. In this work we introduce J1, a reinforcement learning approach to training such models. Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias. In particular, our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model. We provide analysis and ablations comparing Pairwise-J1 vs Pointwise-J1 models, offline vs online training recipes, reward strategies, seed prompts, and variations in thought length and content. We find that our models make better judgments by learning to outline evaluation criteria, comparing against self-generated reference answers, and re-evaluating the correctness of model responses.

Introduction. Better judgments can be made by learning how to reason, which is observed in both humans and machines. For models, the ability to judge predictions is a vital process that is applied at all stages of development: during training and inference to provide a reward or verification signal, and during final benchmark evaluation to judge performance. Classical evaluation using reward models typically outputs a score directly (Ouyang et al., 2022) without having an explicit reasoning step. Using pre-trained and aligned language models to act as judges instead, i.e., LLM-as-a-Judge, allowed the possibility to generate chain-of-thought reasoning before making a decision, which was at first invoked by prompting (Zheng et al., 2023; Gu et al., 2024; Saha et al., 2024). Subsequently, iterative finetuning and direct preference optimization (DPO) methods were developed to improve these reasoning steps (Mahan et al., 2024; Wang et al., 2024d; Saha et al., 2025). In this work, we investigate recipes for further improvements to judgment reasoning via online Reinforcement Learning (RL).

Discussion / Conclusion. We proposed J1, an RL recipe for training Thinking-LLM-as-a-Judge models. Our key innovation was in converting the judgment task into a verifiable task for all kinds of task prompts, themselves both verifiable and non-verifiable, and then optimizing the thoughts and judgments using an online RL method. We trained J1-Llama-8B and J1-Llama-70B, two generalist judge models that outperformed all baselines at their respective model sizes, o1-mini, and even a much larger R1 model on non-verifiable tasks. Using only pairwise supervision, we also trained Pointwise-J1 models that proved to be effective in mitigating position bias, thereby highlighting the potential of both Pairwise and Pointwise Thinking-LLM-as-a-Judge.

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

Synthesis notes that discuss concepts related to this paper