The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning
Abstract—Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research.
Introduction. Modern multimedia applications increasingly demand systems that can understand images at both the object level (recognizing individual entities) and the event level (comprehending interactions and activities). Situation Recognition (SR) has emerged as a crucial task addressing this need by extracting structured semantic representations from images [25], [26]. Generally, SR [25], [26] can be decomposed into three interrelated sub-tasks: verb classification, semantic role labeling, and semantic role grounding. Fig. 1 shows a typical example of SR. Given an image, verb classification requires classifying the occurred visual event types (known as “verbs”). Then, the semantic role labeling intends to classify the noun phrases according to the event-specific arguments (also known as “roles”). Further, the semantic role grounding task focuses on regressing the corresponding bounding boxes for each visual object.
Discussion / Conclusion. This paper presents a comprehensive study of the ambiguity problem of verb classification in context recognition (SR). Through extensive empirical analysis, we show that current single-label classification formulations fail to capture the inherent semantic overlap between verb categories, resulting in suboptimal performance and evaluation results. Our core insight is that verb classification should be fundamentally reformulated as a multi-label learning problem to better reflect visual event recognition’s nature. To address the practical challenge of obtaining full multilabel annotations, we propose to formulate verb classification as a single forward multi-label learning (SPMLL) problem—a new perspective in SR research. Our contributions include: 1) Deeply analyzing verb ambiguity through embeddings and manual annotations; 2) creating a large-scale multi-label evaluation benchmark to enable proper evaluation of SR models; and 3) developing GE-VerbMLP, which combines GNNs and adversarial training for robust performance.