"multimodal llms" Papers

11 papers found

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Shoubin Yu, Jaehong Yoon, Mohit Bansal

ICLR 2025arXiv:2402.05889
17
citations

MINERVA: Evaluating Complex Video Reasoning

Arsha Nagrani, Sachit Menon, Ahmet Iscen et al.

ICCV 2025arXiv:2505.00681
10
citations

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang, XINGYU FU, James Y. Huang et al.

ICLR 2025oralarXiv:2406.09411
120
citations

NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

Lanrui Wang, Mingyu Zheng, Hongyin Tang et al.

NEURIPS 2025arXiv:2504.06560
4
citations

Passing the Driving Knowledge Test

Maolin Wei, Wanzhou Liu, Eshed Ohn-Bar

ICCV 2025arXiv:2508.21824
2
citations

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Zhengfeng Lai, Vasileios Saveris, Chen Chen et al.

ICLR 2025arXiv:2410.02740
9
citations

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Shengqiong Wu, Hao Fei, Xiangtai Li et al.

ICLR 2025arXiv:2406.05127
58
citations

Auto-Encoding Morph-Tokens for Multimodal LLM

Kaihang Pan, Siliang Tang, Juncheng Li et al.

ICML 2024spotlightarXiv:2405.01926
32
citations

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.

CVPR 2024arXiv:2401.06209
593
citations

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Ling Yang, Zhaochen Yu, Chenlin Meng et al.

ICML 2024arXiv:2401.11708
200
citations

V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Penghao Wu, Saining Xie

CVPR 2024arXiv:2312.14135
345
citations