Learning Human-Object Interaction as Groups

Paper · arXiv 2510.18357 · Published October 21, 2025
Action ModelsWorld ModelsDesign Frameworks

Human-Object Interaction Detection (HOI-DET) aims to localize human-object pairs and identify their interactive relationships. To aggregate contextual cues, existing methods typically propagate information across all detected entities via self-attention mechanisms, or establish message passing between humans and objects with bipartite graphs. However, they primarily focus on pairwise relationships, overlooking that interactions in real-world scenarios often emerge from collective behaviors (i.e., multiple humans and objects engaging in joint activities). In light of this, we revisit relation modeling from a group view and propose GroupHOI, a framework that propagates contextual information in terms of geometric proximity and semantic similarity. To exploit the geometric proximity, humans and objects are grouped into distinct clusters using a learnable proximity estimator based on spatial features derived from bounding boxes. In each group, a soft correspondence is computed via self-attention to aggregate and dispatch contextual cues. To incorporate the semantic similarity, we enhance the vanilla transformer-based interaction decoder with local contextual cues from HO-pair features.

Introduction. Human-Object Interaction Detection (HOI-DET), as a critical pillar in visual relationship understanding, identifies entities (i.e., humans and objects) as basic building blocks, and leverages relationships as the connective glue that weaves them into meaningful patterns. Early works [1, 2] typically recognize HO-pairs, which are cropped from natural images by manually obtained bounding boxes, as composites (i.e., visual phrases) in isolation. Recent efforts [3, 4] extend HOI reasoning to real-world scenarios involving multiple entities with complex relational structures. Building upon object detection frameworks, HOI-DET methods evolve alongside advancements in detector architectures from Faster R-CNN [5] to DETR-like variants [6, 7]. As a semantic interpretation task, HOI-DET can also benefit from large visual-linguistic models (e.g., CLIP [8] and BLIP [9]) pre-trained on extensive image-text corpora. Despite architectural innovations and multi-modal knowledge transfer, the core challenge remains unchanged: ❶how to structure and reason about relationships among entities?

Discussion / Conclusion. Future Direction. Current HOI benchmarks involve limited entities within small fields of view, leading to global relational modeling in the mainstream HOI-DET [18] methods. However, when applied to larger scenarios (e.g., gigapixel-level crowd images [61]), this paradigm leads to high computational cost and information redundancy. A practical solution is to confine relational modeling to local regions, making the principles proposed in this paper highly applicable to real-world settings. Limitation. A limitation of our method is its focus on interaction reasoning without integrating the proposed mechanisms into the object detection branch. We think this direction intriguing, yet it lies beyond the scope of the present study. Moreover, like other HOI-DET [18] models, GroupHOI is trained with limited label diversity, posing challenges for generalization to in-the-wild scenarios.