"multimodal alignment" Papers
30 papers found
Conference
Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
Hua Ye, Hang Ding, Siyuan Chen et al.
AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward
Haonan Han, Xiangzuo Wu, Huan Liao et al.
Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models
Eunseop Yoon, Hee Suk Yoon, Mark Hasegawa-Johnson et al.
CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset
Xiao Wang, Fuling Wang, Yuehang Li et al.
Emotional Face-to-Speech
Jiaxin Ye, Boyuan Cao, Hongming Shan
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen, Yunhao Gou, Runhui Huang et al.
FLOPS: Forward Learning with OPtimal Sampling
Tao Ren, Zishi Zhang, Jinyang Jiang et al.
GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
Saarthak Kapse, Pushpak Pati, Srikar Yellapragada et al.
Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment
Mayug Maniparambil, Raiymbek Akshulakov, YASSER ABDELAZIZ DAHOU DJILALI et al.
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning
Wenwen Zhuang, Xin Huang, Xiantao Zhang et al.
MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation
Akio Hayakawa, Masato Ishii, Takashi Shibuya et al.
Multimodal Prompt Alignment for Facial Expression Recognition
Fuyan Ma, Yiran He, Bin Sun et al.
OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-time Emotional Speech Synthesis
Run Luo, Ting-En Lin, Haonan Zhang et al.
Self-Supervised Spatial Correspondence Across Modalities
Ayush Shrivastava, Andrew Owens
VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning
Wenhao Li, Qiangchang Wang, Xianjing Meng et al.
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Zeyi Sun, Ye Fang, Tong Wu et al.
A Touch, Vision, and Language Dataset for Multimodal Alignment
Letian Fu, Gaurav Datta, Huang Huang et al.
Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video
Zhaobo Qi, Yibo Yuan, Xiaowen Ruan et al.
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
Fengyu Yang, Chao Feng, Ziyang Chen et al.
Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding
Ruihuang Li, Zhengqiang ZHANG, Chenhang He et al.
Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens
PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation
Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno et al.
SonicVisionLM: Playing Sound with Vision Language Models
Zhifeng Xie, Shengye Yu, Qile He et al.
STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment
Jaewoo Lee, Jaehong Yoon, Wonjae Kim et al.
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Ziping Ma, Furong Xu, Jian liu et al.
TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang, Wei Ye, Haiyang Xu et al.
Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA
Chengen Lai, Shengli Song, Shiqi Meng et al.
V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
Kun Su, Judith Li, Qingqing Huang et al.
View Selection for 3D Captioning via Diffusion Ranking
Tiange Luo, Justin Johnson, Honglak Lee
X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning
Artemis Panagopoulou, Le Xue, Ning Yu et al.