ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

77citations

arXiv:2407.12442 PDF

citations

#146

in ECCV 2024

of 2387 papers

Top Authors

Data Points

Top Authors

Mengcheng Lan Chaofeng Chen Yiping Ke Xinjiang Wang Litong Feng Wayne Zhang

Abstract

Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP's image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.

Citation History

Jan 25, 2026

Feb 13, 2026

77+9

Feb 13, 2026