Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

Paper · arXiv 2501.11651 · Published January 20, 2025
Test-Time ComputeInference-Time Scaling

Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration and learning from feedback, recent attempts yield only modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through oversampling. We further employ an entropy bonus as an auxiliary loss, alongside a dynamic anchor for regularization to facilitate reward optimization. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. For example, T1 with Qwen2.5-32B as the base model outperforms the recent Qwen QwQ-32B-Preview model on MATH500, AIME2024, and Omni-math-500.

Introduction. Large language models (LLMs) have recently exhibited remarkable capabilities in addressing complex reasoning tasks (Achiam et al., 2023; Team et al., 2023; Dubey et al., 2024), including mathematics Shao et al. (2024), programming (Lozhkov et al., 2024; Zhu et al., 2024), and autonomous agents (Zhou et al., 2024). The chain-of-thought (CoT) paradigm Wei et al. (2022) has been instrumental in enhancing LLM reasoning, emphasizing the importance of constructing and refining reasoning paths (Zelikman et al., 2022; Gulcehre et al., 2023), which represent the intermediate steps critical for problem-solving. Most recent approaches prioritize the imitation learning stage, with significant effort dedicated to generating reasoning paths through prompting (Yu et al., 2024; Mitra et al., 2024; Yue et al., 2024) or rejection sampling (Yuan et al., 2023), followed by training the model to replicate the selected reasoning processes.

Discussion / Conclusion. In this paper, we present T1 for enhancing large language models’ reasoning capabilities through scaled reinforcement learning. By promoting extensive exploration during RL training while maintaining stability through strategic penalties and oversampling, T1 achieves strong reasoning performance and demonstrates promising test-time scaling behavior. We introduce a novel approach to measuring inference scaling by analyzing the relationship between reasoning steps and model performance, revealing that increased RL training improves both reasoning accuracy and inference scaling trends. Experimental results demonstrate that T1 shows excellent performance and outperforms existing models on challenging reasoning benchmarks.