Measuring Faithfulness in Chain-of-Thought Reasoning
Large language models (LLMs) perform better when they produce step-by-step, “Chain-of- Thought” (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model’s actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT’s performance boost does not seem to come from CoT’s added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.
Introduction. It is often critical to understand why a large language model (LLM) provided the output it did, to understand the extent to which we can rely on its output (especially in high-stakes settings such as medicine; Gunning et al., 2019; Rudin, 2019). Many have claimed that the interpretability or explainability of LLMs is enhanced when they are prompted to generate step-by-step reasoning before giving an answer (Li et al., 2022; Wang et al., 2022; Wei et al., 2022; Yao et al., 2023b). Such claims only hold if the generated reasoning is faithful to the model’s true reasoning, meaning that it “accurately represents the reasoning process behind the model’s prediction” (Jacovi & Goldberg, 2020). However, LLM-generated reasoning has been shown to be unfaithful to the model’s true reasoning process in some cases (Turpin et al., 2023), raising the question of if the stated reasoning is ever faithful.
Discussion / Conclusion. In this work, we investigate the faithfulness of reasoning samples produced by large language models using chainof-thought prompting. We test various hypotheses of how chain of thought could provide unfaithful explanations of the model’s reasoning, and apply these tasks across tasks and model size. Our experiments show large variation in the extent of post-hoc reasoning across tasks, and they provide evidence against the hypotheses that increased test-time compute or phrasing-encoded information are drivers of CoT improvement. We also see that the degree of post-hoc reasoning often shows inverse scaling, getting worse with increasingly capable models, suggesting that smaller models may be better to use if faithful reasoning is important. We hope that our metrics for evaluating CoT faithfulness open up avenues for increasing the faithfulness of CoT, building towards systems whose stated reasoning is trustworthy and verifiable.