MoDE: CLIP Data Experts via Clustering

25citations

arXiv:2404.16030 Project

citations

#1074

in CVPR 2024

of 2716 papers

Top Authors

Data Points

Top Authors

Jiawei Ma Po-Yao Huang Saining Xie Shang-Wen Li Luke Zettlemoyer Shih-Fu Chang Wen-tau Yih Hu Xu

Topics

contrastive language-image pretraining data clustering mixture of experts noisy web data zero-shot image classification ensemble learning task metadata correlation

Abstract

The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.

Citation History

Jan 27, 2026

Feb 7, 2026

Feb 13, 2026