TnT-LLM: Text Mining at Scale with Large Language Models

Paper · arXiv 2403.12173 · Published March 18, 2024
Workplace Applications

Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and timeconsuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale.

Introduction. Text mining is the process of extracting useful information and insights from a large collection of textual data [10, 27]. Two central and interrelated tasks in text mining are taxonomy generation, which involves finding and organizing a set of structured, canonical labels that describe aspects of the corpus, and text classification, or the labeling of instances in the corpus using said taxonomy. Many use cases of interest to practitioners can be framed as the sequential application of these two tasks, especially when the label space is not well-defined or when exploring a new corpus: For example, sentiment analysis consists of devising a sentiment taxonomy (e.g., “happy”, “sad”) and classifying text content (e.g., social media posts, product reviews) with labels in this taxonomy. Likewise, intent detection consists of defining a set of intents (e.g., “book a flight”, “buy a product”) and classifying text content (e.g., chatbot transcripts, search queries) with the intent labels.

Discussion / Conclusion. This work has the potential to create significant impact for research and application of AI technologies in text mining. Our framework has demonstrated the ability to use LLMs as taxonomy generators, as well as data labelers and evaluators. These automations could lead to significant efficiency gains and cost savings for a variety of domains and applications that rely on understanding, structuring and analyzing massive volumes of unstructured text. It could also broadly democratize the process of mining knowledge from text, empowering non-expert users and enterprises to interact with and interpret their data through natural language, thereby leading to better insights and data-driven decision making for a range of industries and sectors. Additionally, our framework and research findings relate to other work that leverages LLMs for taxonomy creation and text clustering, and has important empirical lessons for the efficient use of instruction-following models in these scenarios.