From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

Paper · arXiv 2401.15071 · Published January 26, 2024

Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI’s GPT-4 and Google’s Gemini have been deployed. This paper strives to enhance understanding of the gap through the lens of a qualitative study on the generalizability, trustworthiness, and causal reasoning capabilities of recent proprietary and open-source MLLMs across four modalities: i.e., text, code, image, and video, ultimately aiming to improve the transparency of MLLMs. We believe these properties are several representative factors that define the reliability of MLLMs, in supporting various downstream applications. To be specific, we evaluate the closed-source GPT-4 and Gemini and 6 open-source LLMs and MLLMs. Overall we evaluate 232 manually designed cases, where the qualitative results are then summarized into 12 scores (i.e., 4 modalities × 3 properties). In total, we uncover 14 empirical findings that are useful to understand the capabilities and limitations of both proprietary and open-source MLLMs, towards more reliable downstream multi-modal applications.

Introduction. Recent powerful Large Language Models (LLMs) [14, 57, 40, 54] have revolutionized the way machines process texts. By leveraging LLMs as the universal task interfaces, Multi-modal Large Language Models (MLLMs) [41, 53, 36, 2, 68, 34] have shown impressive abilities to interact with multi-modal contents (such as images, videos, codes and texts), and are expected to address more complex multi-modal tasks and be equipped to myriad multi-modal applications. As the frontrunners, MLLMs like GPT-4 [41] from OpenAI and the recently released Gemini [53] by Google, have set new benchmarks in multi-modal capabilities. Moreover, a list of open-source MLLMs are also developed from the industrial and academic communities, many of which have claimed comparable with the aforementioned proprietary models. Unfortunately, the performance of recent MLLMs, no matter whether are the open-source or closed-source models, still cannot be reliable enough to meet the bar of expectation of the broad public.

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

Synthesis notes that discuss concepts related to this paper