Grounding Multilingual Multimodal LLMs With Cultural Knowledge

Paper · arXiv 2508.07414 · Published August 10, 2025

Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturallyrich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. Cultural- Pangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of +5.0% without degrading results on mainstream vision–language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.

Introduction. Despite being trained on billions of image–text pairs, today’s multimodal large language models (MLLMs) remain biased towards English and Western-centric training data (Ramaswamy et al., 2023; Vayani et al., 2024a; Yue et al., 2024; Liu et al., 2025). As a result, MLLMs often excel on high-resource languages, but even state-of-theart models overlook or misinterpret non-Western cultural cues, especially long-tail entities (Liu et al., 2021b; Blasi et al., 2022; Ahia et al., 2023; AlKhamissi et al., 2024; Romero et al., 2024; Ananthram et al., 2024). Simply translating English data or increasing the size of the training corpora does not solve this problem. Translated data remain “Anglo-centric”, and naively scaling up training corpora will not change the underlying biased distribution of the data (Yu et al., 2022; Tao et al., 2024; Gallegos et al., 2024). Recent efforts emphasize the need for targeted, multicultural data curation to bridge this gap (Yue et al., 2024; Liu et al., 2025).

Discussion / Conclusion. By the end of training, CulturalPangea not only acquires remarkably stronger culturally grounded capabilities (averaging +5.0% improvement across cultural benchmarks) but also achieves slightly higher overall VQA accuracy than the baseline, exemplifying a difficult-to-achieve equilibrium between new specialization and retained generalization, and highlighting the promise of Cultural- Ground and training approach. We present a data-centric approach for mining cultural grounded multimodal data from public knowledge bases. CulturalPangea, a model trained on the resulting dataset demonstrates the effectiveness of the approach and outperforms prior open-source MLLMs on numerous cultural benchmarks such as CVQA, ALMBench, XM100, and MERLIN while preserving general and multilingual visionlanguage skills. Our findings show that deliberately curating culturally rich data is essential for creating more inclusive multimodal LLMs.

Grounding Multilingual Multimodal LLMs With Cultural Knowledge

Synthesis notes that discuss concepts related to this paper