"image-text retrieval" Papers
21 papers found
Conference
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
Amit Peleg, Naman Deep Singh, Matthias Hein
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
Haicheng Wang, Chen Ju, Weixiong Lin et al.
AmorLIP: Efficient Language-Image Pretraining via Amortization
Haotian Sun, Yitong Li, Yuchen Zhuang et al.
Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment
Yang Liu, Mengyuan Liu, Shudong Huang et al.
Beyond Modality Collapse: Representation Blending for Multimodal Dataset Distillation
xin zhang, Ziruo Zhang, JIAWEI DU et al.
Compositional Entailment Learning for Hyperbolic Vision-Language Models
Avik Pal, Max van Spengler, Guido D'Amely di Melendugno et al.
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
Yuxuan Wang, Yijun Liu, Fei Yu et al.
Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment
Mayug Maniparambil, Raiymbek Akshulakov, YASSER ABDELAZIZ DAHOU DJILALI et al.
HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models
Han Liu, Jiaqi Li, Zhi Xu et al.
MASS: Overcoming Language Bias in Image-Text Matching
Jiwan Chung, Seungwon Lim, Sangkyu Lee et al.
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Gensheng Pei, Tao Chen, Yujia Wang et al.
Semi-Supervised CLIP Adaptation by Enforcing Semantic and Trapezoidal Consistency
Kai Gan, Bo Ye, Min-Ling Zhang et al.
VladVA: Discriminative Fine-tuning of LVLMs
Yassine Ouali, Adrian Bulat, ALEXANDROS XENOS et al.
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Dachuan Shi, Chaofan Tao, Anyi Rao et al.
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
Junyi Chen, Longteng Guo, Jia Sun et al.
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
Chenglin Yang, Siyuan Qiao, Yuan Cao et al.
Language-Image Pre-training with Long Captions
Kecheng Zheng, Yifei Zhang, Wei Wu et al.
Multi-Label Cluster Discrimination for Visual Representation Learning
Xiang An, Kaicheng Yang, Xiangzi Dai et al.
Revisiting the Role of Language Priors in Vision-Language Models
Zhiqiu Lin, Xinyue Chen, Deepak Pathak et al.
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Ziping Ma, Furong Xu, Jian liu et al.
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling
Haoyu Lu, Yuqi Huo, Guoxing Yang et al.