"image-text retrieval" Papers

21 papers found

Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

Amit Peleg, Naman Deep Singh, Matthias Hein

NEURIPS 2025arXiv:2505.24424
2
citations

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Haicheng Wang, Chen Ju, Weixiong Lin et al.

CVPR 2025arXiv:2412.00440
10
citations

AmorLIP: Efficient Language-Image Pretraining via Amortization

Haotian Sun, Yitong Li, Yuchen Zhuang et al.

NEURIPS 2025arXiv:2505.18983
2
citations

Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment

Yang Liu, Mengyuan Liu, Shudong Huang et al.

AAAI 2025paperarXiv:2503.06974
6
citations

Beyond Modality Collapse: Representation Blending for Multimodal Dataset Distillation

xin zhang, Ziruo Zhang, JIAWEI DU et al.

NEURIPS 2025arXiv:2505.14705
3
citations

Compositional Entailment Learning for Hyperbolic Vision-Language Models

Avik Pal, Max van Spengler, Guido D'Amely di Melendugno et al.

ICLR 2025arXiv:2410.06912
37
citations

CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

Yuxuan Wang, Yijun Liu, Fei Yu et al.

AAAI 2025paperarXiv:2407.01081
7
citations

Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

Mayug Maniparambil, Raiymbek Akshulakov, YASSER ABDELAZIZ DAHOU DJILALI et al.

CVPR 2025arXiv:2409.19425
2
citations

HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models

Han Liu, Jiaqi Li, Zhi Xu et al.

NEURIPS 2025

MASS: Overcoming Language Bias in Image-Text Matching

Jiwan Chung, Seungwon Lim, Sangkyu Lee et al.

AAAI 2025paperarXiv:2501.11469

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

Gensheng Pei, Tao Chen, Yujia Wang et al.

CVPR 2025arXiv:2503.17080
5
citations

Semi-Supervised CLIP Adaptation by Enforcing Semantic and Trapezoidal Consistency

Kai Gan, Bo Ye, Min-Ling Zhang et al.

ICLR 2025
3
citations

VladVA: Discriminative Fine-tuning of LVLMs

Yassine Ouali, Adrian Bulat, ALEXANDROS XENOS et al.

CVPR 2025arXiv:2412.04378
11
citations

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

Dachuan Shi, Chaofan Tao, Anyi Rao et al.

ICML 2024arXiv:2305.17455
43
citations

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Junyi Chen, Longteng Guo, Jia Sun et al.

AAAI 2024paperarXiv:2308.11971
20
citations

IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers

Chenglin Yang, Siyuan Qiao, Yuan Cao et al.

ECCV 2024arXiv:2311.17072
3
citations

Language-Image Pre-training with Long Captions

Kecheng Zheng, Yifei Zhang, Wei Wu et al.

ECCV 2024arXiv:2403.17007
65
citations

Multi-Label Cluster Discrimination for Visual Representation Learning

Xiang An, Kaicheng Yang, Xiangzi Dai et al.

ECCV 2024arXiv:2407.17331
14
citations

Revisiting the Role of Language Priors in Vision-Language Models

Zhiqiu Lin, Xinyue Chen, Deepak Pathak et al.

ICML 2024arXiv:2306.01879
39
citations

SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment

Ziping Ma, Furong Xu, Jian liu et al.

ICML 2024arXiv:2401.02137
7
citations

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Haoyu Lu, Yuqi Huo, Guoxing Yang et al.

ICLR 2024arXiv:2302.06605
55
citations