Zero-Shot Verification-guided Chain of Thoughts

Paper · arXiv 2501.13122 · Published January 21, 2025
Chain-of-Thought and Reasoning Methods

Previous works have demonstrated the effectiveness of Chain-of-Thought (COT) prompts and verifiers in guiding Large Language Models (LLMs) through the space of reasoning. However, most such studies either use a finetuned verifier or rely on manually handcrafted few-shot examples. In contrast, in this paper, we focus on LLM-based self-verification of self-generated reasoning steps via COT prompts in a completely zero-shot regime. To explore this setting, we design a new zero-shot prompt, which we call COT STEP, to aid zeroshot decomposition of reasoning steps and design two new zero-shot prompts for LLM-based verifiers. We evaluate the verifiers’ ability to classify the correctness of reasoning chains and explore different ways to use verifier scores in guiding reasoning for various mathematical and commonsense reasoning tasks with different LLMs.

Introduction. Large Language Models (LLMs) (Dubey et al., 2024; Brown et al., 2020) have revolutionized the field of NLP by enabling state-of-the-art (SOTA) performance in several tasks merely by smart prompting (Wei et al., 2022a; Bubeck et al., 2023). A critical landmark in the art of prompting for multi-step reasoning tasks is the Chain-of-Thought (COT) prompting (Nye et al., 2022; Wei et al., 2022b), which elicits LLMs to generate step-bystep reasoning chains before providing the answer to a given question. Building upon COT prompting, several recent works incorporate a verifier mechanism to improve LLMs’ performance. For instance, Cobbe et al. (2021); Li et al. (2023); Weng et al. (2023) use a verifier to evaluate the correctness and score each reasoning step, whereas Gandhi et al. (2023); Yao et al. (2023); Hao et al. (2023, 2024) use a verifier’s scores to do a tree-search over the space of reasoning steps. However, these works have several drawbacks including: 1.

Discussion / Conclusion. We have the following takeaways: (1) prompts like PS+ or TAB COT are not necessarily systematically better than COT. (2) COT STEP offers an elegant zero-shot strategy to decompose reasoning steps virtually without any accuracy loss compared to COT. (3) Zero-shot COT prompt (COTR-prompt) is also useful for verification particularly in mathe- matical domain. (4) Zero-shot verifier scores do not particularly help in augmenting self-consistency. (5) Using zero-shot verifier scores to guide reasoning step search in a step-wise stochastic greedy manner can be helpful but its benefit compared to plain COT disappears when using self-consistency. Beam search did not help either. (6) The verifier works better with COT STEP-based decomposition than TAB COT.