Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
Recently, there has been significant progress in teaching language models to perform step-bystep reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is the state-of-art method for many of these tasks. CoT uses language models to produce text describing reasoning, and computation, and finally the answer to a question. Here we propose ‘Program of Thoughts’ (PoT), which uses language models (mainly Codex) to generate text and programming language statements, and finally an answer. In PoT, the computation can be delegated to a program interpreter, which is used to execute the generated program, thus decoupling complex computation from reasoning and language understanding. We evaluate PoT on five math word problem datasets and three financial- QA datasets in both few-shot and zero-shot settings. We find that PoT has an average performance gain over CoT of around 12% across all datasets. By combining PoT with self-consistency decoding, we can achieve extremely strong performance on all the math datasets and financial datasets. All of our data and code will be released.
Introduction. Numerical reasoning is a long-standing task in artificial intelligence. A surge of datasets has been proposed recently to benchmark deep-learning models’ capabilities to perform numerical/arithmetic reasoning. Some widely used benchmarks are based on Math word problems (MWP) (Cobbe et al., 2021; Patel et al., 2021; Lu et al., 2022; Ling et al., 2017), where systems are supposed to answer math questions expressed with natural text. Besides MWP, some datasets also consider financial problems (Chen et al., 2021b; 2022; Zhu et al., 2021), where systems need to answer math-driven financial questions. Prior work (Ling et al., 2017; Cobbe et al., 2021) has studied how to train models from scratch or fine-tune models to generate intermediate steps to derive the final answer. Such methods are data-intensive, requiring a significant number of training examples with expert-annotated steps. Recently, Nye et al.
Discussion / Conclusion. In this work, we have verified that our prompting methods can work efficiently on numerical reasoning tasks like math or finance problem solving. We also study how to combine PoT with CoT to combine the merits of both prompting approaches. We believe PoT is suitable for problems which require highly symbolic reasoning skills. For semantic reasoning tasks like commonsense reasoning (StrategyQA), we conjecture that PoT is not the best option. In contrast, CoT can solve more broader reasoning tasks. In this work, we investigate how to disentangle computation from reasoning in solving numerical problems. By ‘program of thoughts’ prompting, we are able to elicit LLMs’ abilities to generate accurate programs to express complex reasoning procedure, while also allows computation to be separately handled by an external program interpreter. This approach is able to boost the performance of LLMs on several math datasets significantly. We believe our work can inspire more work to combine symbolic execution with LLMs to achieve better performance on other symbolic reasoning tasks.