Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

Paper · arXiv 2502.02508 · Published February 4, 2025
Test-Time ComputeSelf-Refinement and Self-ConsistencyInference-Time Scaling

Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs’ reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chainof-Action-Thought (COAT) reasoning and a twostage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data.

Introduction. Large language models (LLMs) have demonstrated remarkable performance across a wide range of reasoning tasks, including mathematical problems (Cobbe et al., 2021; Hendrycks et al., 2021a), programming (Chen et al., 2021; Zhuo et al., 2024) and logical reasoning (Han et al., 2024; Liu et al., 2020). One of the key techniques enabling these strong reasoning capabilities is Chain-of-Thought (CoT) prompting (Wei et al., 2022), which allows LLMs to address complex tasks by generating a series of intermediate reasoning steps. As a result, many early efforts focus on finetuning LLMs using large-scale, high-quality CoT reasoning chains, either through human annotation (Hendrycks et al., 2021a; Yue et al., 2024) or by distilling synthetic data from more advanced models (Yu et al., 2024; Toshniwal et al., 2024a; Ding et al., 2024). However, human annotation is extremely labor intensive, and distillation often limits the model’s reasoning capabilities to certain level.

Discussion / Conclusion. The training framework of Satori exhibits significant potential for enhancing LLM reasoning capabilities. The smallscale format tuning stage serves as a warm-up phase, allowing the LLM policy to internalize a specific reasoning format, while large-scale reinforcement learning (RL) plays a crucial role in incentivizing intrinsic reasoning abilities. We believe that this framework can inspire the research community to explore more methods for achieving autoregressive search, such as developing reasoning formats with a broader range of meta-actions, designing more advanced RL algorithms, and extending this approach to general domain.