"vision-language alignment" Papers
28 papers found
Conference
$\Delta \mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization
Lin Zhu, Yifeng Yang, Xinbing Wang et al.
Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text matching
Yang Liu, Wentao Feng, Zhuoyao Liu et al.
Assessing and Learning Alignment of Unimodal Vision and Language Models
Le Zhang, Qian Yang, Aishwarya Agrawal
Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment
Yang Liu, Mengyuan Liu, Shudong Huang et al.
BATCLIP: Bimodal Online Test-Time Adaptation for CLIP
Sarthak Kumar Maharana, Baoming Zhang, Leonid Karlinsky et al.
CompCap: Improving Multimodal Large Language Models with Composite Captions
Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab et al.
Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning
Jingjing Jiang, Chao Ma, Xurui Song et al.
Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models
Shicheng Xu, Liang Pang, Yunchang Zhu et al.
DH-Set: Improving Vision-Language Alignment with Diverse and Hybrid Set-Embeddings Learning
Kun Zhang, Jingyu Li, Zhe Li et al.
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
Dahyun Kang, Piotr Bojanowski, Huy V. Vo et al.
DS-VLM: Diffusion Supervision Vision Language Model
Zhen Sun, Yunhang Shen, Jie Li et al.
ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models
Duy M. H. Nguyen, Nghiem Diep, Trung Nguyen et al.
Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
Bingchao Wang, Zhiwei Ning, Jianyu Ding et al.
MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios
Jiacheng Ruan, Wenzhen Yuan, Zehao Lin et al.
MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
Liang Yin, Xudong Xie, Zhang Li et al.
ParGo: Bridging Vision-Language with Partial and Global Views
An-Lan Wang, Bin Shan, Wei Shi et al.
RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability
Jonggwon Park, Byungmu Yoon, Soobum Kim et al.
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Gensheng Pei, Tao Chen, Yujia Wang et al.
SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Shaoan Xie, Lingjing Kong, Yujia Zheng et al.
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set
Shufan Shen, Junshu Sun, Qingming Huang et al.
APL: Anchor-based Prompt Learning for One-stage Weakly Supervised Referring Expression Comprehension
Yaxin Luo, Jiayi Ji, Xiaofu Chen et al.
CLIM: Contrastive Language-Image Mosaic for Region Representation
Size Wu, Wenwei Zhang, Lumin XU et al.
LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation
Pengwei Yin, Jingjing Wang, Guanzhong Zeng et al.
LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model
Dilxat Muhtar, Zhenshi Li, Feng Gu et al.
Multi-Task Domain Adaptation for Language Grounding with 3D Objects
Penglei SUN, Yaoxian Song, Xinglin Pan et al.
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
Meng Chu, Zhedong Zheng, Wei Ji et al.
Weakly Supervised Open-Vocabulary Object Detection
Jianghang Lin, Yunhang Shen, Bingquan Wang et al.
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
Zeyu Han, Fangrui Zhu, Qianru Lao et al.