"modality alignment" Papers

24 papers found

3D-MolT5: Leveraging Discrete Structural Information for Molecule-Text Modeling

Qizhi Pei, Rui Yan, Kaiyuan Gao et al.

ICLR 2025arXiv:2406.05797
6
citations

An Effective Levelling Paradigm for Unlabeled Scenarios

Fangming Cui, Zhou Yu, Di Yang et al.

NEURIPS 2025

BRACE: A Benchmark for Robust Audio Caption Quality Evaluation

Tianyu Guo, Hongyu Chen, Hao Liang et al.

NEURIPS 2025arXiv:2512.10403
3
citations

Gramian Multimodal Representation Learning and Alignment

Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo et al.

ICLR 2025arXiv:2412.11959
33
citations

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Chenxin Tao, Shiqian Su, Xizhou Zhu et al.

CVPR 2025arXiv:2412.16158
5
citations

Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval

Yue Wu, Zhaobo Qi, Yiling Wu et al.

ICLR 2025
7
citations

Learning Source-Free Domain Adaptation for Visible-Infrared Person Re-Identification

Yongxiang Li, Yanglin Feng, Yuan Sun et al.

NEURIPS 2025

MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation

Jinfeng Xu, Zheyu Chen, Shuo Yang et al.

AAAI 2025paperarXiv:2402.19407
50
citations

Multi-modal Learning: A Look Back and the Road Ahead

Divyam Madaan, Sumit Chopra, Kyunghyun Cho

ICLR 2025

Multimodal Tabular Reasoning with Privileged Structured Information

Jun-Peng Jiang, Yu Xia, Hai-Long Sun et al.

NEURIPS 2025arXiv:2506.04088
9
citations

MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion

Fei Peng, Junqiang Wu, Yan Li et al.

ICCV 2025arXiv:2508.14440
2
citations

Object-Shot Enhanced Grounding Network for Egocentric Video

Yisen Feng, Haoyu Zhang, Meng Liu et al.

CVPR 2025arXiv:2505.04270
7
citations

One Filters All: A Generalist Filter For State Estimation

Shiqi Liu, Wenhan Cao, Chang Liu et al.

NEURIPS 2025arXiv:2509.20051
2
citations

Post-pre-training for Modality Alignment in Vision-Language Foundation Models

Shin'ya Yamaguchi, Dewei Feng, Sekitoshi Kanai et al.

CVPR 2025arXiv:2504.12717
12
citations

ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams

Chris Dongjoo Kim, Jihwan Moon, Sangwoo Moon et al.

CVPR 2025arXiv:2504.14875
1
citations

scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration

Jianle Sun, Chaoqi Liang, Ran Wei et al.

NEURIPS 2025spotlightarXiv:2510.24987
2
citations

Unsupervised Audio-Visual Segmentation with Modality Alignment

Swapnil Bhosale, Haosen Yang, Diptesh Kanojia et al.

AAAI 2025paperarXiv:2403.14203
8
citations

Vocabulary-Guided Gait Recognition

Panjian Huang, Saihui Hou, Chunshui Cao et al.

NEURIPS 2025

CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

Faegheh Sardari, Armin Mustafa, Philip JB Jackson et al.

ECCV 2024arXiv:2405.10690
11
citations

Conceptual Codebook Learning for Vision-Language Models

Yi Zhang, Ke Yu, Siqi Wu et al.

ECCV 2024arXiv:2407.02350
7
citations

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

Sravanti Addepalli, Ashish Asokan, Lakshay Sharma et al.

CVPR 2024arXiv:2310.08255
48
citations

Object-Oriented Anchoring and Modal Alignment in Multimodal Learning

Shibin Mei, Bingbing Ni, Hang Wang et al.

ECCV 2024
1
citations

Tabular Insights, Visual Impacts: Transferring Expertise from Tables to Images

Jun-Peng Jiang, Han-Jia Ye, Leye Wang et al.

ICML 2024spotlight

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

Qianrui Zhou, Hua Xu, Hao Li et al.

AAAI 2024paperarXiv:2312.14667
35
citations