Unifying 3D Vision-Language Understanding via Promptable Queries

63
citations
#193
in ECCV 2024
of 2387 papers
9
Top Authors
4
Data Points

Abstract

A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying different 3D scene representations (\ie, voxels, point clouds, multi-view images) into a common 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates superior performance on most tasks, setting new records on most benchmarks. Particularly, PQ3D boosts the state-of-the-art on ScanNet200 by 1.8% (AP), ScanRefer by 5.4% (acc@0.5), Multi3DRef by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with whatever 3D representations are available, e.g., solely relying on voxels.

Citation History

Jan 25, 2026
64
Feb 13, 2026
63
Feb 13, 2026
63
Feb 13, 2026
63