Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

Paper · arXiv 2307.02477 · Published July 5, 2023
Logical Reasoning and Internal RulesReasoning CritiquesLLM Failure Modes

The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on “counterfactual” task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.

Introduction. The striking empirical successes of language models (LMs) suggest that next-word prediction at scale may be a viable approach for distilling the knowledge embedded in large-scale text corpora into general-purpose interactive agents. LMs obtain impressive results on various NLP benchmarks (OpenAI, 2023; Anil et al., 2023; Anthropic, 2023; i.a.) and display surprising abilities that suggest a nontrivial understanding of the world (Bubeck et al., 2023). They have been shown to pass professional exams (Kung et al., 2023; Nori et al., 2023; Terwiesch, 2023; i.a.), exceed state-of-the-art methods on many traditional benchmarks (Sun et al., 2023; Sobania et al., 2023; Zhang et al., 2023a; Dhingra et al., 2023; i.a.), and surpass human performance on tasks that require seemingly nontrivial reasoning (Chowdhery et al., 2022; Hoffmann et al., 2022; Malinka et al., 2023; Guo et al., 2023; i.a.). Ideally, we expect a general-purpose LM to be able to generalize not only to unseen instances of known tasks, but to new tasks.

Discussion / Conclusion. Do humans also perform worse with unfamiliar counterfactual conditions? It is possible that humans may have lower performance under the counterfactual conditions with a fixed time budget, but not necessarily when given ample time to reason and revise. Analogous to the classic competence/performance distinction in linguistics (Chomsky, 1965, §1.1), we hypothesize that humans have the competence to generalize to new task conditions, even though it may sometimes require sufficient execution budget to realize it as robust performance.12 In fact, there is increasing evidence from cognitive science that human reasoning is scaffolded by rich causal models of the world (Pearl, 1988; Lake et al., 2017; Ullman and Tenenbaum, Through our counterfactual evaluation on 11 tasks, we identified consistent and substantial degradation of LM performance under counterfactual con- ditions. We attribute this gap to overfitting to the default task variants, and thus encourage future LM analyses to explicitly consider abstract task ability as detached from observed task performance, especially when these evaluated task variants might exist in abundance in the LM pretraining corpora.