"multimodal large language models" Papers

310 papers found • Page 6 of 7

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

Qi Wang, Yanrui Yu, Ye Yuan et al.

NEURIPS 2025oralarXiv:2505.12434
34
citations

Video Summarization with Large Language Models

Min Jung Lee, Dayoung Gong, Minsu Cho

CVPR 2025arXiv:2504.11199
11
citations

Vid-SME: Membership Inference Attacks against Large Video Understanding Models

Qi Li, Runpeng Yu, Xinchao Wang

NEURIPS 2025oralarXiv:2506.03179
5
citations

Vision Function Layer in Multimodal LLMs

Cheng Shi, Yizhou Yu, Sibei Yang

NEURIPS 2025arXiv:2509.24791
4
citations

Visual Instruction Bottleneck Tuning

Changdae Oh, Jiatong Li, Shawn Im et al.

NEURIPS 2025arXiv:2505.13946
3
citations

VisualLens: Personalization through Task-Agnostic Visual History

Wang Bill Zhu, Deqing Fu, Kai Sun et al.

NEURIPS 2025arXiv:2411.16034

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu, Haojia Lin, Xiong Wang et al.

NEURIPS 2025spotlightarXiv:2501.01957
138
citations

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning

Nilay Yilmaz, Maitreya Patel, Lawrence Luo et al.

ICLR 2025arXiv:2503.00043
1
citations

VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias et al.

ICCV 2025arXiv:2510.14672
2
citations

Walking the Tightrope: Autonomous Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning

Xiaoyu Yang, Jie Lu, En Yu

NEURIPS 2025oral
6
citations

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Zinuo Li, Xian Zhang, Yongxin Guo et al.

NEURIPS 2025oralarXiv:2505.18110
3
citations

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Hyungjoo Chae, Seonghwan Kim, Junhee Cho et al.

NEURIPS 2025spotlightarXiv:2505.15277
8
citations

What Kind of Visual Tokens Do We Need? Training-Free Visual Token Pruning for Multi-Modal Large Language Models from the Perspective of Graph

Yutao Jiang, Qiong Wu, Wenhao Lin et al.

AAAI 2025paperarXiv:2501.02268
20
citations

WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Yuci Liang, Xinheng Lyu, Meidan Ding et al.

ICCV 2025arXiv:2412.02141
10
citations

X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

jian ma, Qirong Peng, Xu Guo et al.

ICCV 2025arXiv:2503.06134
5
citations

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

Fengxiang Wang, hongzhen wang, Zonghao Guo et al.

CVPR 2025highlightarXiv:2503.23771
26
citations

You Only Communicate Once: One-shot Federated Low-Rank Adaptation of MLLM

Binqian Xu, Haiyang Mei, Zechen Bai et al.

NEURIPS 2025

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Xudong Li, Mengdan Zhang, Peixian Chen et al.

NEURIPS 2025arXiv:2505.22396
2
citations

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast

Xiangming Gu, Xiaosen Zheng, Tianyu Pang et al.

ICML 2024arXiv:2402.08567
103
citations

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Wenbo Hu, Yifan Xu, Yi Li et al.

AAAI 2024paperarXiv:2308.09936
192
citations

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Qilang Ye, Zitong Yu, Rui Shao et al.

ECCV 2024arXiv:2403.04640
50
citations

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Yixuan Wu, Yizhou Wang, Shixiang Tang et al.

ECCV 2024arXiv:2403.12488
48
citations

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Yichi Zhang, Yinpeng Dong, Siyuan Zhang et al.

CVPR 2024highlightarXiv:2404.11207
18
citations

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Keen You, Haotian Zhang, Eldon Schoop et al.

ECCV 2024arXiv:2404.05719
157
citations

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

Jie Yang, Xuesong Niu, Nan Jiang et al.

ECCV 2024arXiv:2407.12435
23
citations

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

Zhikai Zhang, Yitang Li, Haofeng Huang et al.

ECCV 2024arXiv:2406.10740
8
citations

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

Zhangyang Qi, Ye Fang, Zeyi Sun et al.

CVPR 2024highlightarXiv:2312.02980
64
citations

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Chuofan Ma, Yi Jiang, Jiannan Wu et al.

ECCV 2024arXiv:2404.13013
107
citations

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Yichi Zhang, Ziqiao Ma, Xiaofeng Gao et al.

CVPR 2024arXiv:2402.16846
76
citations

Grounding Language Models for Visual Entity Recognition

Zilin Xiao, Ming Gong, Paola Cascante-Bonilla et al.

ECCV 2024arXiv:2402.18695
13
citations

GSVA: Generalized Segmentation via Multimodal Large Language Models

Zhuofan Xia, Dongchen Han, Yizeng Han et al.

CVPR 2024arXiv:2312.10103
130
citations

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Yifan Li, hangyu guo, Kun Zhou et al.

ECCV 2024arXiv:2403.09792
101
citations

Improving Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning

Wei Li, Hehe Fan, Yongkang Wong et al.

ICML 2024

InstructDoc: A Dataset for Zero

Shot Generalization of Visual Document Understanding with Instructions - Ryota Tanaka, Taichi Iki, Kyosuke Nishida et al.

AAAI 2024paperarXiv:2401.13313
36
citations

Interactive Continual Learning: Fast and Slow Thinking

Biqing Qi, Xinquan Chen, Junqi Gao et al.

CVPR 2024arXiv:2403.02628
36
citations

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

Chuwei Luo, Yufan Shen, Zhaoqing Zhu et al.

CVPR 2024arXiv:2404.05225
109
citations

Listen, Think, and Understand

Yuan Gong, Hongyin Luo, Alexander Liu et al.

ICLR 2024arXiv:2305.10790
224
citations

LLMCO4MR: LLMs-aided Neural Combinatorial Optimization for Ancient Manuscript Restoration from Fragments with Case Studies on Dunhuang

Yuqing Zhang, Hangqi Li, Shengyu Zhang et al.

ECCV 2024
6
citations

LLMGA: Multimodal Large Language Model based Generation Assistant

Bin Xia, Shiyin Wang, Yingfan Tao et al.

ECCV 2024arXiv:2311.16500
25
citations

Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

Zhuo Huang, Chang Liu, Yinpeng Dong et al.

ICML 2024arXiv:2312.02546
23
citations

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

Xiaoqi Li, Mingxu Zhang, Yiran Geng et al.

CVPR 2024arXiv:2312.16217
182
citations

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang et al.

ICML 2024arXiv:2402.04788
281
citations

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Xin Liu, Yichen Zhu, Jindong Gu et al.

ECCV 2024arXiv:2311.17600
199
citations

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

Yiran Qin, Enshen Zhou, Qichang Liu et al.

CVPR 2024arXiv:2312.07472
77
citations

NExT-GPT: Any-to-Any Multimodal LLM

Shengqiong Wu, Hao Fei, Leigang Qu et al.

ICML 2024arXiv:2309.05519
726
citations

Osprey: Pixel Understanding with Visual Instruction Tuning

Yuqian Yuan, Wentong Li, Jian liu et al.

CVPR 2024arXiv:2312.10032
149
citations

PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects

Junyi Li, Junfeng Wu, Weizhi Zhao et al.

ECCV 2024arXiv:2407.16696
13
citations

PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology

Yuxuan Sun, Chenglu Zhu, Sunyi Zheng et al.

AAAI 2024paperarXiv:2305.15072
79
citations

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Agneet Chatterjee, Yiran Luo, Tejas Gokhale et al.

ECCV 2024arXiv:2408.02231
10
citations

RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models

Qi Lv, Hao Li, Xiang Deng et al.

ICML 2024arXiv:2404.04929
4
citations