ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering

Paper · arXiv 2507.16403 · Published July 22, 2025
Multimodal ModelsLLM Evaluations and BenchmarksKnowledge GraphsQuestion Answering and Search

In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a low-cost framework, which is capable of generating complex, multi-hop questions. We evaluated state-of-the-art VQA models on ReasonVQA, and the empirical results demonstrate that ReasonVQA poses significant challenges to these models, highlighting its potential for benchmarking and advancing the field of VQA. Additionally, our dataset can be easily scaled with respect to input images; the current version surpasses the largest existing datasets requiring external knowledge by more than an order of magnitude. 1

Introduction. In recent years, significant advancements have been made in the field of Visual Question Answering (VQA) on standard VQA datasets [1, 4, 10, 37]. Initially, these datasets focused mainly on simple questions related to object identification and attributes, such as name, shape, color, and position. Towards the goal of general-purpose artificial intelligence, VQA models are expected to answer questions that require a deeper understanding of the world, fine-grained visual recognition, and multi-step reasoning. Recently, several additional VQA datasets [7, 11, 14, 22, 28, 32] have been introduced to challenge VQA systems to handle more complex questions. However, there are limitations associated with these datasets. Some datasets are entirely synthetic, while others rely heavily on manual human effort.

Discussion / Conclusion. We have proposed a novel VQA dataset in which external knowledge is required to answer questions. Our dataset construction framework is cost-effective, scalable, and re-