Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation

0citations
0
citations
#2482
in CVPR 2025
of 2873 papers
5
Top Authors
5
Data Points

Abstract

Contrastive language–image pretraining models such as CLIP have demonstrated remarkable performance in various text-image alignment tasks. However, the inherent 77-token input limitation and reliance on predominantly short-text training data restrict its ability to handle long-text tasks effectively. To overcome these constraints, we propose LongD-CLIP, a dual-teacher distillation framework designed to enhance long-text representation while mitigating knowledge forgetting. In our approach, a teacher model, fine-tuned on long-text data, distills rich representation knowledge into a student model, while the original CLIP serves as a secondary teacher to help the student retain its foundational knowledge. Extensive experiments reveal that LongD-CLIP significantly outperforms existing models across long-text retrieval, short-text retrieval, and zero-shot image classification tasks. For instance, in the image-to-text retrieval task on the ShareGPT4V test set, LongD-CLIP exceeds Long-CLIP’s performance by 2.5%, achieving an accuracy of 98.3%. Similarly, on the Urban1k dataset, it records a 9.2% improvement, reaching 91.9%, thereby underscoring its robust generalization capabilities. Additionally, the text encoder of LongD-CLIP exhibits reduced latent space drift and improved compatibility with existing generative models, effectively overcoming the 77token input constraint.

Citation History

Jan 26, 2026
0
Jan 27, 2026
0
Jan 27, 2026
0
Feb 2, 2026
0
Feb 7, 2026
0