Empathy Through Multimodality in Conversational Interfaces
Abstract—Agents represent one of the most emerging applications of Large Language Models (LLMs) and Generative AI, with their effectiveness hinging on multimodal capabilities to navigate complex user environments. Conversational Health Agents (CHAs), a prime example of this, are redefining healthcare by offering nuanced support that transcends textual analysis to incorporate emotional intelligence. This paper introduces an LLMbased CHA engineered for rich, multimodal dialogue—especially in the realm of mental health support. It adeptly interprets and responds to users’ emotional states by analyzing multimodal cues, thus delivering contextually aware and empathetically resonant verbal responses. Our implementation leverages the versatile openCHA framework, and our comprehensive evaluation involves neutral prompts expressed in diverse emotional tones: sadness, anger, and joy. We evaluate the consistency and repeatability of the planning capability of the proposed CHA. Furthermore, human evaluators critique the CHA’s empathic delivery, with findings revealing a striking concordance between the CHA’s outputs and evaluators’ assessments.
Introduction. Human conversations transcend mere words, orchestrated as a multimedia experience where tonal inflections, facial dynamics, and gestural semantics are interwoven. These nonverbal cues enrich the emotional and contextual semantics of our exchanges, serving a role analogous to metadata in digital content. Echoing Socrates’ ancient apprehensions about written language, we recognize the imperative to resurrect the soul of conversation within our digital interactions. The advent of mobile technology, replete with sophisticated biometric sensors and capabilities for environmental data capture, has ushered in a transformative shift in communication. Physiological signatures, measured through technologies such as photoplethysmography, accelerometers, and transdermal optical imaging, now provide integral data streams, enriching the field of emotional analytics. This integration of multimodal sensory data with computational intelligence, especially when interfaced with cuttingedge Generative AI and Large Language Models (LLMs), marks the dawn of a new era in human-computer interaction.
Discussion / Conclusion. Our exploration into the realm of multimodal CHAs using LLMs offers a promising avenue towards revolutionizing human-computer interaction. In this paper, we introduced an LLM-powered multimodal CHA, tailored for in-depth dialogues within health support environments. This agent was capable of interpreting emotional cues from speech patterns to provide context-aware and empathetic verbal responses. Employing the openCHA framework, we integrated an LLM with speech-to-text, speech emotion detection, Internet search, and text-to-speech tools. Our evaluation was conducted in two stages. We, first, assessed the planning capabilities of the agent. Our findings showed that the planner obtained %89 accuracy to identify the emotional state from the voice and retrieve related information pertinent to the user’s query. It also obtained %61 accuracy to correctly call the Internet searches tool based on the emotion states. Then, we evaluated the responses in terms of empathy.