"multimodal alignment" Papers

30 papers found

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

Hua Ye, Hang Ding, Siyuan Chen et al.

NEURIPS 2025arXiv:2511.08399

AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward

Haonan Han, Xiangzuo Wu, Huan Liao et al.

CVPR 2025arXiv:2411.18654
5
citations

Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

Eunseop Yoon, Hee Suk Yoon, Mark Hasegawa-Johnson et al.

ICLR 2025arXiv:2507.04976
4
citations

CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset

Xiao Wang, Fuling Wang, Yuehang Li et al.

CVPR 2025arXiv:2410.00379
19
citations

Emotional Face-to-Speech

Jiaxin Ye, Boyuan Cao, Hongming Shan

ICML 2025arXiv:2502.01046
4
citations

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen, Yunhao Gou, Runhui Huang et al.

CVPR 2025arXiv:2409.18042
48
citations

FLOPS: Forward Learning with OPtimal Sampling

Tao Ren, Zishi Zhang, Jinyang Jiang et al.

ICLR 2025arXiv:2410.05966
2
citations

GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology

Saarthak Kapse, Pushpak Pati, Srikar Yellapragada et al.

ICCV 2025highlightarXiv:2504.01009
4
citations

Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

Mayug Maniparambil, Raiymbek Akshulakov, YASSER ABDELAZIZ DAHOU DJILALI et al.

CVPR 2025arXiv:2409.19425
2
citations

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

Wenwen Zhuang, Xin Huang, Xiantao Zhang et al.

AAAI 2025paperarXiv:2408.08640
60
citations

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Akio Hayakawa, Masato Ishii, Takashi Shibuya et al.

ICLR 2025arXiv:2405.17842
18
citations

Multimodal Prompt Alignment for Facial Expression Recognition

Fuyan Ma, Yiran He, Bin Sun et al.

ICCV 2025arXiv:2506.21017
2
citations

OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-time Emotional Speech Synthesis

Run Luo, Ting-En Lin, Haonan Zhang et al.

NEURIPS 2025

Self-Supervised Spatial Correspondence Across Modalities

Ayush Shrivastava, Andrew Owens

CVPR 2025arXiv:2506.03148
2
citations

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Wenhao Li, Qiangchang Wang, Xianjing Meng et al.

NEURIPS 2025arXiv:2509.25033
4
citations

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Zeyi Sun, Ye Fang, Tong Wu et al.

CVPR 2024arXiv:2312.03818
170
citations

A Touch, Vision, and Language Dataset for Multimodal Alignment

Letian Fu, Gaurav Datta, Huang Huang et al.

ICML 2024arXiv:2402.13232
74
citations

Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video

Zhaobo Qi, Yibo Yuan, Xiaowen Ruan et al.

AAAI 2024paperarXiv:2401.07567
15
citations

Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Fengyu Yang, Chao Feng, Ziyang Chen et al.

CVPR 2024arXiv:2401.18084
112
citations

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Ruihuang Li, Zhengqiang ZHANG, Chenhang He et al.

ECCV 2024arXiv:2407.09781
11
citations

Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks

Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens

ECCV 2024arXiv:2403.09377
4
citations

PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation

Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno et al.

ECCV 2024arXiv:2409.06535
7
citations

SonicVisionLM: Playing Sound with Vision Language Models

Zhifeng Xie, Shengye Yu, Qile He et al.

CVPR 2024arXiv:2401.04394
3
citations

STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment

Jaewoo Lee, Jaehong Yoon, Wonjae Kim et al.

ICML 2024oral

SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment

Ziping Ma, Furong Xu, Jian liu et al.

ICML 2024arXiv:2401.02137
7
citations

TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training

Chaoya Jiang, Wei Ye, Haiyang Xu et al.

AAAI 2024paperarXiv:2312.08846
6
citations

Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA

Chengen Lai, Shengli Song, Shiqi Meng et al.

AAAI 2024paperarXiv:2312.13594
10
citations

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Kun Su, Judith Li, Qingqing Huang et al.

AAAI 2024paperarXiv:2305.06594
24
citations

View Selection for 3D Captioning via Diffusion Ranking

Tiange Luo, Justin Johnson, Honglak Lee

ECCV 2024arXiv:2404.07984
31
citations

X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning

Artemis Panagopoulou, Le Xue, Ning Yu et al.

ECCV 2024
6
citations