Pixel-Level Reasoning Segmentation via Multi-turn Conversations

Paper · arXiv 2502.09447 · Published February 13, 2025
Multimodal Models

Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions. Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixellevel Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn conversation understanding, generating pixel-grounded explanations aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in pixel-level reasoning segmentation. Experimental results on the PRIST dataset demonstrate that our method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based reasoning metrics.

Introduction. Existing general multimodal large language models (MLLMs) (Bai et al., 2023; Zhu et al., 2023; Liu et al., 2024b) exhibit exceptional visual perception, enabling both image segmentation and textual reasoning, while they primarily rely on explicit human instructions for region-level grounding. Although some segmentation-specific works have explored grounded reasoning responses (Peng et al.,

Discussion / Conclusion. In this paper, we propose a novel task, Pixel-level Reasoning Segmentation, which focuses on finegrained segmentation. To further advance, we construct a pixel-level reasoning segmentation dataset, PRIST, consisting of 24k utterances and 8.3k pixellevel segmentation targets, generated through a carefully designed three-stage progressive automatic annotation pipeline. Additionally, we present MIRAS, a framework designed for this task that combines segmentation with multi-turn interaction, along with LLM-based reasoning quality evaluation metrics. Comprehensive experiments on segmentation and reasoning demonstrate the effectiveness of the PRIST dataset and the superior performance of MIRAS, which advances research in pixel-level reasoning segmentation meaningfully.