Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

Paper · arXiv 2404.01869 · Published April 2, 2024
Reasoning ArchitecturesPhilosophy and SubjectivityArgumentation and Persuasion

Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning, leading to a lively debate on whether these models possess reasoning capabilities similar to humans. However, despite these successes, the depth of LLMs’ reasoning abilities remains uncertain. This uncertainty partly stems from the predominant focus on task performance, measured through shallow accuracy metrics, rather than a thorough investigation of the models’ reasoning behavior. This paper seeks to address this gap by providing a comprehensive review of studies that go beyond task accuracy, offering deeper insights into the models’ reasoning processes. Furthermore, we survey prevalent methodologies to evaluate the reasoning behavior of LLMs, emphasizing current trends and efforts towards more nuanced reasoning analyses. Our review suggests that LLMs tend to rely on surface-level patterns and correlations in their training data, rather than on sophisticated reasoning abilities. Additionally, we identify the need for further research that delineates the key differences between human and LLM-based reasoning. Through this survey, we aim to shed light on the complex reasoning processes within LLMs.

Introduction. Reasoning is an integral aspect of human intelligence and deliberate, rational thought (Holyoak & Morrison, 2005). It allows individuals to draw conclusions from available information and move beyond their current knowledge (Lohman & Lakin, 2011). As such, reasoning plays a fundamental role in problem-solving and decision-making, and has been a long-standing goal within the field of artificial intelligence (Robinson & Voronkov, 2001). In recent years, large language models have demonstrated remarkable performance on tasks that require reasoning (Bubeck et al., 2023; Wei et al., 2022; Kojima et al., 2022). This has sparked a vigorous debate about the extent to which these models possess reasoning abilities similar to humans (Mitchell & Krakauer, 2023; Mitchell, 2023; Borji, 2023).

Discussion / Conclusion. Despite the notable performance of large language models in prominent reasoning tasks (Bubeck et al., 2023; Fu et al., 2023), our review suggests that current models more closely resemble stochastic parrots (Bender et al., 2021) than systematic reasoners. As discussed in Section 3, we find that although many LLMs demonstrate proficiency in reasoning problems that align with their training data, the models’ reasoning behavior reveals significant conceptual errors and limitations in out-of-distribution scenarios. As highlighted by Mahowald et al. (2024), this suggests a limited functional linguistic competence in LLMs. It is likely that the apparent success of LLMs in reasoning tasks predominantly reflects their ability to memorize the extensive data they have been trained on (Wu et al., 2024; Dziri et al., 2023). Recent studies indicate that a substantial amount of benchmark datasets has been leaked to current LLMs (Balloccu et al., 2024; Xu et al., 2024), raising concerns about the insights derived from their performance on such benchmarks. Therefore, we advocate for more nuanced analyses of the models’ reasoning behavior, particularly in novel scenarios that the models have not previously encountered.