The Double-Ellipsoid Geometry of CLIP

14
citations
#406
in ICML 2025
of 3340 papers
2
Top Authors
4
Data Points

Abstract

Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood, and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training.Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty.A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP's modality gap optimizes the matching of the conformity distributions of image and text.

Citation History

Jan 28, 2026
10
Feb 13, 2026
14+4
Feb 13, 2026
14
Feb 13, 2026
14