Leveraging LLMs for KPIs Retrieval from Hybrid Long-Document: A Comprehensive Framework and Dataset
Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains unexplored. Due to the hybrid text often appears in the form of hybrid long documents (HLDs), which is far exceed the token limit of LLMs. Consequently, we apply an naive split-recombination-based framework (SiReF) to enable LLMs process the HLDs and carry out experiments to analysis four important aspects of information extraction from HLDs. Given the findings: 1) The effective way to select and summarize the useful part of a HLD. 2) An easy table serialization way is enough for LLMs to understand tables. 3) The naive SiReF has adaptability in many and complex scenarios. 4) The useful prompt engineering to enhance LLMs on HLDs. To address the issue of dataset scarcity in HLDs and support the future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset. The dataset and code are publicly available in the attachments.
Introduction. LLMs have exhibited remarkable capabilities in various natural language tasks, demonstrating their potential to comprehend and process intricate textual data (Wei et al., 2023a; Wang et al., 2023b; Zhou et al., 2022; Kojima et al., 2023). In addition to their success in textual data, the studies by (Chen, 2023; Ye et al., 2023) highlight the effectiveness of LLMs in handling tabular data. However, despite their proven proficiency in understanding and analyzing textual and tabular data individually, research exploring the capacity of LLMs to tackle hybrid documents, which combine these two types of data, remains relatively scarce. Hybrid documents, adeptly blending textual and tabular content, are widely used across diverse fields and extensive in length typically. Extracting relevant information from hybrid long documents (HLDs) based on user-provided keywords is a crucial upstream task that supports various applications, such as question-answering systems (Gil, 2023), document classification (Mustafa et al., 2023), information retrieval (da Silva et al., 2023), and more.
Discussion / Conclusion. In order to enable LLMs to extract information from HLDs, we propose an Automated Information Extraction (AIE) framework, comprising four modules: Segmentation, Retrieval, Summarization, and Extraction. To analyze the various modules of AIE, we construct a dataset from publicly available financial reports, called Financial Reports Numerical Extraction (FINE). Based on the FINE, experiments offer a detailed explanation of the impact of each module on AIE’s ability to extract information from HLDs. We also validate the overall performance of AIE on three different domains: scientific papers, Wikipedia, and financial reports. Experimental results show that AIE significantly improves the ability of LLMs to handle HLDs.