Poster "cross-modal alignment" Papers

42 papers found

Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

Junming Liu, Siyuan Meng, Yanting Gao et al.

ICCV 2025arXiv:2503.12972
19
citations

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

Yan Li, Yifei Xing, Xiangyuan Lan et al.

CVPR 2025arXiv:2412.00833
17
citations

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Ahmed Masry, Juan Rodriguez, Tianyu Zhang et al.

NEURIPS 2025arXiv:2502.01341
1
citations

Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process

Tsai Hor Chan, Feng Wu, Yihang Chen et al.

NEURIPS 2025arXiv:2510.20736

ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding

Zhenxing Zhang, Yaxiong Wang, Lechao Cheng et al.

CVPR 2025arXiv:2412.12718
11
citations

Beyond Modality Collapse: Representation Blending for Multimodal Dataset Distillation

xin zhang, Ziruo Zhang, JIAWEI DU et al.

NEURIPS 2025arXiv:2505.14705
3
citations

Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning

Tianjiao Jiang, Zhen Zhang, Yuhang Liu et al.

ICCV 2025arXiv:2508.03102
1
citations

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Edson Araujo, Andrew Rouditchenko, Yuan Gong et al.

CVPR 2025arXiv:2505.01237
2
citations

CF-VLM:CounterFactual Vision-Language Fine-tuning

jusheng zhang, Kaitong Cai, Yijia Fan et al.

NEURIPS 2025

CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs

Jinlan Fu, Shenzhen Huangfu, Hao Fei et al.

ICLR 2025arXiv:2501.16629
21
citations

DenseGrounding: Improving Dense Language-Vision Semantics for Ego-centric 3D Visual Grounding

Henry Zheng, Hao Shi, Qihang Peng et al.

ICLR 2025arXiv:2505.04965
8
citations

Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

Jeonghyeon Kim, Sangheum Hwang

CVPR 2025arXiv:2503.18817
4
citations

Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective

Xiangru Zhu, Penglei Sun, Yaoxian Song et al.

ICLR 2025arXiv:2410.10291
2
citations

Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning

Yiyang Chen, Shanshan Zhao, Lunhao Duan et al.

ICCV 2025arXiv:2507.09102

Hierarchical Cross-modal Prompt Learning for Vision-Language Models

Hao Zheng, Shunzhi Yang, Zhuoxin He et al.

ICCV 2025arXiv:2507.14976
6
citations

It’s a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Dominik Schnaus, Nikita Araslanov, Daniel Cremers

CVPR 2025arXiv:2503.24129
7
citations

Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval

Yue Wu, Zhaobo Qi, Yiling Wu et al.

ICLR 2025
7
citations

Learning Source-Free Domain Adaptation for Visible-Infrared Person Re-Identification

Yongxiang Li, Yanglin Feng, Yuan Sun et al.

NEURIPS 2025

Mitigate the Gap: Improving Cross-Modal Alignment in CLIP

Sedigheh Eslami, Gerard de Melo

ICLR 2025
15
citations

Multimodal Prompt Alignment for Facial Expression Recognition

Fuyan Ma, Yiran He, Bin Sun et al.

ICCV 2025arXiv:2506.21017
2
citations

Preacher: Paper-to-Video Agentic System

Jingwei Liu, Ling Yang, Hao Luo et al.

ICCV 2025arXiv:2508.09632
2
citations

ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning

Quanxing Zha, Xin Liu, Shu-Juan Peng et al.

CVPR 2025arXiv:2502.19962
1
citations

Robust Cross-modal Alignment Learning for Cross-Scene Spatial Reasoning and Grounding

Yanglin Feng, Hongyuan Zhu, Dezhong Peng et al.

NEURIPS 2025

Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification

Tuo Xiang, Xuemiao Xu, Bangzhen Liu et al.

ICCV 2025arXiv:2509.14958
1
citations

Seg4Diff: Unveiling Open-Vocabulary Semantic Segmentation in Text-to-Image Diffusion Transformers

Chaehyun Kim, Heeseong Shin, Eunbeen Hong et al.

NEURIPS 2025
6
citations

Semi-Supervised CLIP Adaptation by Enforcing Semantic and Trapezoidal Consistency

Kai Gan, Bo Ye, Min-Ling Zhang et al.

ICLR 2025
3
citations

SGAR: Structural Generative Augmentation for 3D Human Motion Retrieval

Jiahang Zhang, Lilang Lin, Shuai Yang et al.

NEURIPS 2025

TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models

Ziyang Luo, Nian Liu, Xuguang Yang et al.

ICCV 2025arXiv:2506.11436
3
citations

The Indra Representation Hypothesis

Jianglin Lu, Hailing Wang, Kuo Yang et al.

NEURIPS 2025

When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

Youqi WU, Jingwei Zhang, Farzan Farnia

NEURIPS 2025arXiv:2506.08645
2
citations

Amend to Alignment: Decoupled Prompt Tuning for Mitigating Spurious Correlation in Vision-Language Models

Jie ZHANG, Xiaosong Ma, Song Guo et al.

ICML 2024

Audio-visual Generalized Zero-shot Learning the Easy Way

Shentong Mo, Pedro Morgado

ECCV 2024arXiv:2407.13095
8
citations

Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models

Weiwei Cao, Jianpeng Zhang, Yingda Xia et al.

CVPR 2024arXiv:2404.04936
15
citations

CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring

Taewoo Kim, Hoonhee Cho, Kuk-Jin Yoon

ECCV 2024arXiv:2408.14930
12
citations

Integration of Global and Local Representations for Fine-grained Cross-modal Alignment

Seungwan Jin, Hoyoung Choi, Taehyung Noh et al.

ECCV 2024
1
citations

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Yuxiao Chen, Kai Li, Wentao Bao et al.

ECCV 2024arXiv:2409.16145
7
citations

Position: The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang et al.

ICML 2024

Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection

Yuanpeng Tu, Boshen Zhang, Liang Liu et al.

ECCV 2024arXiv:2401.03145
24
citations

Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Xinyao Li, Yuke Li, Zhekai Du et al.

CVPR 2024arXiv:2403.06946
19
citations

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

Chull Hwan Song, Taebaek Hwang, Jooyoung Yoon et al.

CVPR 2024arXiv:2404.01156
11
citations

Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation

Yuanhong Chen, Yuyuan Liu, Hu Wang et al.

CVPR 2024arXiv:2304.02970
34
citations

Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models

Jinhao Li, Haopeng Li, Sarah Erfani et al.

ICML 2024arXiv:2406.02915
26
citations