"multimodal video understanding" Papers
7 papers found
Conference
Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo, Shuailei Ma, Shijie Ma et al.
ICLR 2025oralarXiv:2504.02061
9
citations
ConViS-Bench: Estimating Video Similarity Through Semantic Concepts
Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero et al.
NEURIPS 2025arXiv:2509.19245
1
citations
H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving
Siran Chen, Yuxiao Luo, Yue Ma et al.
AAAI 2025paperarXiv:2501.04302
6
citations
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He, Weixi Feng, Kaizhi Zheng et al.
ICLR 2025arXiv:2406.08407
36
citations
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
Xiao Yu, Yan Fang, Yao Zhao et al.
NEURIPS 2025oralarXiv:2505.23155
2
citations
DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement
Hao Wu, Huabin Liu, Yu Qiao et al.
CVPR 2024arXiv:2404.02755
20
citations
Exploiting Auxiliary Caption for Video Grounding
Hongxiang Li, Meng Cao, Xuxin Cheng et al.
AAAI 2024paperarXiv:2301.05997
14
citations