Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting
Language models can be prompted to reason through problems in a manner that significantly improves performance. However, why such prompting improves performance is unclear. Recent work showed that using logically invalid Chain-of-Thought (CoT) prompting improves performance almost as much as logically valid CoT prompting, and that editing CoT prompts to replace problem-specific information with abstract information or out-of-distribution information typically doesn’t harm performance. Critics have responded that these findings are based on too few and too easy tasks to draw meaningful conclusions. To resolve this dispute, we test whether logically invalid CoT prompts offer the same level of performance gains as logically valid prompts on the hardest tasks in the BIG-Bench benchmark, termed BIG-Bench Hard (BBH). We find that the logically invalid reasoning prompts do indeed achieve similar performance gains on BBH tasks as logically valid reasoning prompts. We also discover that some CoT prompts used by previous works contain logical errors. This suggests that covariates beyond logically valid reasoning are responsible for performance improvements.
Introduction. Language models can perform significantly better when prompted in particular ways. For example, prompts that recommend or guide language models through step-by-step processing have been shown to significantly improve performance on question answering, conversational response generation and other tasks (Nye et al., 2021; Wei et al., 2022b; Jung et al., 2022; Kojima et al., 2022; Yao et al., 2023). These prompting techniques are especially effective on the hardest tasks (Suzgun et al., 2022) in the BIG-Bench benchmark (Srivastava et al., 2022), leading many to conclude that such techniques unlock emergent1 human-like reasoning abilities in large language models (Wei et al., 2022a). However, why such prompting strategies improve performance is unclear. Madaan & Yazdanbakhsh (2022) showed replacing problem-specific information in Chainof-Thought (CoT) prompts with either abstract information or out-of-distribution information typically doesn’t harm CoT’s performance gains, and Wang et al.
Discussion / Conclusion. On the diverse and challenging BIG-Bench Hard tasks, we find that Chain-of-Thought prompting performs best on average, but logically invalid Chain-of-Thought prompting is close behind and outperforms Answer Only prompting. This demonstrates that completely illogical reasoning in the CoT prompts do not significantly harm the performance of the language model. Our findings suggest that valid reasoning in prompting is not the chief driver of performance gains, raising the question of what is. We note that there are complementary approaches towards achieving reasoning in language models such as enforcing valid reasoning in Our work raises important questions for future work. Why are models robust to invalid CoT prompts? What features of the data or prompts result in the model outputting inconsistent or invalid outputs? Does increasing the degree of ”incorrectness” or the number of incorrect prompts affect the model’s sensitivity to invalid CoT? What other properties of the valid prompts is the model sensitive to?