Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom

Paper · arXiv 2404.19509 · Published April 30, 2024
Philosophy and SubjectivityNLP and Linguistics

Understanding the non-literal meaning of an utterance is critical for large language models (LLMs) to become human-like social communicators. In this work, we introduce SwordsmanImp, the first Chinese multi-turn-dialogue-based dataset aimed at conversational implicature, sourced from dialogues in the Chinese sitcom My Own Swordsman. It includes 200 carefully handcrafted questions, all annotated on which Gricean maxims have been violated. We test eight close-source and open-source LLMs under two tasks: a multiple-choice question task and an implicature explanation task. Our results show that GPT-4 attains human-level accuracy (94%) on multiple-choice questions. CausalLM demonstrates a 78.5% accuracy following GPT-4. Other models, including GPT3.5 and several open-source models, demonstrate a lower accuracy ranging from 20% to 60% on multiple-choice questions. Human raters were asked to rate the explanation of the implicatures generated by LLMs on their reasonability, logic and fluency. While all models generate largely fluent and self-consistent text, their explanations score low on reasonability except for GPT-4, suggesting that most LLMs cannot produce satisfactory explanations of the implicatures in the conversation.

Introduction. The complexity of communication is largely epitomized by indirect, or non-literal utterances. A common instance is hinting at a busy schedule as a polite refusal to engage in an unwanted activity. How such implied meaning is understood in human communication has lone been a key subject of investigation in pragmatics research (Grice, 1975; Searle et al., 1980; Brown & Levinson, 1987; Wilson & Sperber, 2006). Evaluating the pragmatic understanding ability of large language models (LLMs) has drawn considerable attention in recent years as LLMs show remarkable ability for language understanding. Recent studies have evaluated LLMs’ pragmatic reasoning in multiple aspects, including scalar inference (Hu et al., 2023b), discourse connectives (Pandia et al., 2021), gradable adjectives (Lipkin et al., 2023) and conversational implicatures (Qiu et al., 2023; Kim et al., 2023; Ruis et al., 2022; Hu et al., 2023a; Zheng et al., 2021). However, the above-mentioned evaluation are primarily in English, leaving a gap for pragmatic understanding in other languages.

Discussion / Conclusion. Our results from Experiment 1 show that the performance of GPT4 on our proposed benchmark is on par with humans, while other models are at least 15 points behind (including GPT-3.5-turbo). This suggests that while in principle pragmatic implicatures can be acquired by arguable the best LLMs at the moment, it is a non-trivial task for other LLMs. Results from Experiment 1 also reveal no significant by-maxim variance in human accuracy, as well as model accuracy (see Figure 2). This is different from the results in previous work on human processing of implicatures (Engelhardt et al., 2006; Rubio-Fernandez, 2019; Okanda et al., 2015; Panzeri & Foppolo, 2021), which demonstrate that humans sanction infringements of the maxims in different ways, being less sensitive to the violation of the maxim of quantity than to others, leading to more processing difficulty for this maxim.