Poster "multimodal llms" Papers
9 papers found
Conference
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu, Jaehong Yoon, Mohit Bansal
ICLR 2025arXiv:2402.05889
17
citations
MINERVA: Evaluating Complex Video Reasoning
Arsha Nagrani, Sachit Menon, Ahmet Iscen et al.
ICCV 2025arXiv:2505.00681
10
citations
NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables
Lanrui Wang, Mingyu Zheng, Hongyin Tang et al.
NEURIPS 2025arXiv:2504.06560
4
citations
Passing the Driving Knowledge Test
Maolin Wei, Wanzhou Liu, Eshed Ohn-Bar
ICCV 2025arXiv:2508.21824
2
citations
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
Zhengfeng Lai, Vasileios Saveris, Chen Chen et al.
ICLR 2025arXiv:2410.02740
9
citations
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Shengqiong Wu, Hao Fei, Xiangtai Li et al.
ICLR 2025arXiv:2406.05127
58
citations
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.
CVPR 2024arXiv:2401.06209
593
citations
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs
Ling Yang, Zhaochen Yu, Chenlin Meng et al.
ICML 2024arXiv:2401.11708
200
citations
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Penghao Wu, Saining Xie
CVPR 2024arXiv:2312.14135
345
citations