Large Multimodal Agents: A Survey
Large language models (LLMs) have achieved superior performance in powering text-based AI agents, endowing them with decision-making and reasoning abilities akin to humans. Concurrently, there is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain. This extension enables AI agents to interpret and respond to diverse multimodal user queries, thereby handling more intricate and nuanced tasks. In this paper, we conduct a systematic review of LLM-driven multimodal agents, which we refer to as large multimodal agents (LMAs for short). First, we introduce the essential components involved in developing LMAs and categorize the current body of research into four distinct types. Subsequently, we review the collaborative frameworks integrating multiple LMAs, enhancing collective efficacy. One of the critical challenges in this field is the diverse evaluation methods used across existing studies, hindering effective comparison among different LMAs. Therefore, we compile these evaluation methodologies and establish a comprehensive framework to bridge the gaps. This framework aims to standardize evaluations, facilitating more meaningful comparisons.
Introduction. An agent is a system capable of perceiving its environment and making decisions based on these perceptions to achieve specific goals [56]. While proficient in narrow domains, early agents[35, 50] often lack adaptability and generalization, highlighting a significant disparity with human intelligence. Recent advancements in large language models (LLMs) have begun to bridge this gap, where LLMs enhance their capabilities in command interpretation, knowledge assimilation [36, 78], and mimicry of human reasoning and learning [21, 66]. These agents use LLMs as their primary decision-making tool and are further enhanced with critical human-like features, such as memory. This enhancement allows them to handle a variety of natural language processing tasks and interact with the environment using language [40, 38]. However, real-world scenarios often involve information that spans beyond text, encompassing multiple modalities, with a significant emphasis on the visual aspect.
Discussion / Conclusion. In this survey, we provide a thorough overview of the latest research on multimodal agents driven by LLMs (LMAs). We start by introducing the core components of LMAs (i.e., perception, planning, action, and memory) and classify existing studies into four categories. Subsequently, we compile existing methodologies for evaluating LMAs and devise a comprehensive evaluation framework. Finally, we spotlight a range of current and significant application scenarios within the realm of LMAs. Despite the notable progress, this field still faces many unresolved challenges, and there is considerable room for improvement. We finally highlight several promising directions based on the reviewed progress: • On frameworks: The future frameworks of LMAs may evolve from two distinct perspectives. From the viewpoint of a single agent, development could progress towards the creation of a more unified system.