DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

48citations

arXiv:2403.12488 PDF

citations

#268

in ECCV 2024

of 2387 papers

Top Authors

Data Points

Top Authors

Yixuan Wu Yizhou Wang Shixiang Tang Wenhao Wu Tong He Wanli Ouyang Philip Torr Jian Wu

Topics

multimodal large language models zero-shot object detection prompting paradigm detection prompting toolkit chain-of-thought reasoning open-vocabulary detection referring expression comprehension visual grounding

Abstract

We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object detection ability of multimodal large language models (MLLMs), such as GPT-4V and Gemini. Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts. Specifically, the prompts in the toolkit are designed to guide the MLLM to focus on regional information (e.g., zooming in), read coordinates according to measure standards (e.g., overlaying rulers and compasses), and infer from the contextual information (e.g., overlaying scene graphs). Building upon these tools, the new detection chain-of-thought can automatically decompose the task into simple subtasks, diagnose the predictions, and plan for progressive box refinements. The effectiveness of our framework is demonstrated across a spectrum of detection tasks, especially hard cases. Compared to existing state-of-the-art methods, GPT-4V with our DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for zero-shot referring expression comprehension, +14.5% AP on D-cube describe object detection FULL setting.

Citation History

Jan 25, 2026

Jan 31, 2026

Feb 5, 2026

48+1

Feb 13, 2026