To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra “thinking” really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model’s response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT’s gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.
Introduction. Chain-of-thought (CoT) (Nye et al., 2022; Wei et al., 2022) has become a widely used prompting technique for eliciting reasoning from language models. CoT can provide human-readable explanations of how problems are solved (Joshi et al., 2023; Lanham et al., 2023), but most frequently it is invoked to improve an LLM’s ability to answer complex questions via intermediate computation (Madaan & Yazdanbakhsh, 2022; Wang et al., 2023a; Dziri et al., 2023). Current post-training schemes for LLMs heavily infuse CoT capabilities into models: systems like ChatGPT or Llama 3.1 default to CoT when given reasoning problems (OpenAI, 2023; Dubey et al., 2024). CoT has seen widespread usage, but it is most heavily explored in the domain of mathematical reasoning (Zhou et al., 2023a; Fu et al., 2023; Chae et al., 2024; Xu et al., 2024b; Qi et al., 2024). In fact, many “reasoning” methods for LLMs are evaluated only in the math domain; for instance, Lightman et al.
Discussion / Conclusion. Where is CoT helping and why? Our results showing CoT improvement for math and logic aligns well with early work on CoT for LLMs such as Scratchpads (Nye et al., 2022). As CoT gained popularity, its application has broadened to tasks that canonically do not require multiple steps. It can often yield small improvements over direct answering. We believe this led to the current prevailing sentiment that deliberation should improve performance on any task requiring some type of reasoning (our original claim from Section 2). However, our results show a clear separation between performance on non-symbolic and symbolic tasks. If, in theory, any question could benefit from deliberation, why is CoT only benefiting the questions that can be solved through symbolic manipulation? Our results from Section 5 suggest that the primary benefit of CoT comes in the ability to execute symbolic steps and track their output.