"vision-language alignment" Papers

28 papers found

$\Delta \mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization

Lin Zhu, Yifeng Yang, Xinbing Wang et al.

NEURIPS 2025

Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text matching

Yang Liu, Wentao Feng, Zhuoyao Liu et al.

ICCV 2025arXiv:2503.14953
1
citations

Assessing and Learning Alignment of Unimodal Vision and Language Models

Le Zhang, Qian Yang, Aishwarya Agrawal

CVPR 2025highlightarXiv:2412.04616
15
citations

Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment

Yang Liu, Mengyuan Liu, Shudong Huang et al.

AAAI 2025paperarXiv:2503.06974
6
citations

BATCLIP: Bimodal Online Test-Time Adaptation for CLIP

Sarthak Kumar Maharana, Baoming Zhang, Leonid Karlinsky et al.

ICCV 2025arXiv:2412.02837
3
citations

CompCap: Improving Multimodal Large Language Models with Composite Captions

Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab et al.

ICCV 2025arXiv:2412.05243
6
citations

Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning

Jingjing Jiang, Chao Ma, Xurui Song et al.

ICCV 2025highlightarXiv:2507.07424
7
citations

Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models

Shicheng Xu, Liang Pang, Yunchang Zhu et al.

ICLR 2025arXiv:2410.12662
14
citations

DH-Set: Improving Vision-Language Alignment with Diverse and Hybrid Set-Embeddings Learning

Kun Zhang, Jingyu Li, Zhe Li et al.

CVPR 2025
2
citations

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Dahyun Kang, Piotr Bojanowski, Huy V. Vo et al.

CVPR 2025arXiv:2412.16334
46
citations

DS-VLM: Diffusion Supervision Vision Language Model

Zhen Sun, Yunhang Shen, Jie Li et al.

ICML 2025
1
citations

ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models

Duy M. H. Nguyen, Nghiem Diep, Trung Nguyen et al.

NEURIPS 2025arXiv:2410.02615
5
citations

Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

Bingchao Wang, Zhiwei Ning, Jianyu Ding et al.

ICCV 2025arXiv:2507.10095
7
citations

MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

Jiacheng Ruan, Wenzhen Yuan, Zehao Lin et al.

AAAI 2025paperarXiv:2409.16084
11
citations

MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling

Liang Yin, Xudong Xie, Zhang Li et al.

NEURIPS 2025arXiv:2506.10609

ParGo: Bridging Vision-Language with Partial and Global Views

An-Lan Wang, Bin Shan, Wei Shi et al.

AAAI 2025paperarXiv:2408.12928
25
citations

RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability

Jonggwon Park, Byungmu Yoon, Soobum Kim et al.

NEURIPS 2025arXiv:2504.07416
1
citations

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

Gensheng Pei, Tao Chen, Yujia Wang et al.

CVPR 2025arXiv:2503.17080
5
citations

SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Shaoan Xie, Lingjing Kong, Yujia Zheng et al.

CVPR 2025highlightarXiv:2507.22264
4
citations

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Shufan Shen, Junshu Sun, Qingming Huang et al.

NEURIPS 2025arXiv:2510.21323
1
citations

APL: Anchor-based Prompt Learning for One-stage Weakly Supervised Referring Expression Comprehension

Yaxin Luo, Jiayi Ji, Xiaofu Chen et al.

ECCV 2024

CLIM: Contrastive Language-Image Mosaic for Region Representation

Size Wu, Wenwei Zhang, Lumin XU et al.

AAAI 2024paperarXiv:2312.11376
25
citations

LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation

Pengwei Yin, Jingjing Wang, Guanzhong Zeng et al.

ECCV 2024arXiv:2411.08606
9
citations

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

Dilxat Muhtar, Zhenshi Li, Feng Gu et al.

ECCV 2024arXiv:2402.02544
133
citations

Multi-Task Domain Adaptation for Language Grounding with 3D Objects

Penglei SUN, Yaoxian Song, Xinglin Pan et al.

ECCV 2024arXiv:2407.02846
2
citations

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

Meng Chu, Zhedong Zheng, Wei Ji et al.

ECCV 2024arXiv:2311.12751
26
citations

Weakly Supervised Open-Vocabulary Object Detection

Jianghang Lin, Yunhang Shen, Bingquan Wang et al.

AAAI 2024paperarXiv:2312.12437
17
citations

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Zeyu Han, Fangrui Zhu, Qianru Lao et al.

CVPR 2024arXiv:2311.17048
21
citations