Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Paper · arXiv 2402.14848 · Published February 19, 2024
Logical Reasoning and Internal Rules

This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs). Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs’ reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that traditional perplexity metrics do not correlate with performance of LLMs’ in long input reasoning tasks. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.

Introduction. Recent advancements in Large Language Models (LLMs) show impressive performance across a range of tasks (OpenAI, 2023; Anil et al., 2023; Jiang et al., 2024), including answering correctly complex questions requiring multiple reasoning steps (Kojima et al., 2022; Wei et al., 2022). These models also claim to support increasingly longer inputs. This development underscores the need to examine their performance on the longer inputs they are now technically supporting. A reasonable assumption is that support for long inputs would transfer across tasks and enable a model adept at solving a task when presented in a short input prompt, to perform the same task when it is embedded within a longer prompt. Does this assumption hold? Recent studies that benchmark models over tasks that involve longer inputs, including reasoning tasks, indicate that indeed models often struggle with reasoning over long inputs (Shaham et al., 2023; Li et al., 2023; Bai et al., 2023).

Discussion / Conclusion. We study the effect of input length on reasoning performance of current Large Language Models (LLMs). Our findings reveal a significant drop in performance with longer inputs, occurring well before reaching the models’ maximum input-length capacity. Our experiments relied on FLenQA, a dataset we constructed that allows to isolate the length factor, by adjusting the parts in the input that are irrelevant to the task. We show that regardless of how we adjust the samples, there is still a strong effect of length on reasoning performance. Finally, we identified specific failure modes, including difficulties in following extended instructions and biases towards less relevant information. Our analysis reveals specific failings, providing possible directions for future studies to address and rectify the weaknesses found in LLMs. In conclusion, our work indicates that evaluating a model’s performance based on a single input length does not provide a full picture, and more nuanced evaluation is required. We argue that for a model to be considered capable at long range, it must maintain its performance at any length it technically supports.