"multi-modal large language models" Papers

40 papers found

Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning

Yuti Liu, Shice Liu, Junyuan Gao et al.

AAAI 2025paperarXiv:2412.11952

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

Zhen Xing, Qi Dai, Zejia Weng et al.

ICCV 2025arXiv:2406.06465
24
citations

Aligning Effective Tokens with Video Anomaly in Large Language Models

YINGXIAN Chen, Jiahui Liu, Ruidi Fan et al.

ICCV 2025arXiv:2508.06350
1
citations

Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models

Hulingxiao He, Geng Li, Zijun Geng et al.

ICLR 2025arXiv:2501.15140
19
citations

A Token-level Text Image Foundation Model for Document Understanding

Tongkun Guan, Zining Wang, Pei Fu et al.

ICCV 2025arXiv:2503.02304
4
citations

BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models

Zenghui Yuan, Jiawen Shi, Pan Zhou et al.

CVPR 2025arXiv:2503.16023
10
citations

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

Dohwan Ko, Ji Soo Lee, Minhyuk Choi et al.

ICCV 2025highlightarXiv:2507.23284
3
citations

DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding

Xiaoyi Bao, Chen-Wei Xie, Hao Tang et al.

ICCV 2025arXiv:2507.15569
1
citations

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Shengyuan Liu, Boyun Zheng, Wenting Chen et al.

NEURIPS 2025arXiv:2505.23601
10
citations

Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants

Lixiong Qin, Shilong Ou, Miaoxuan Zhang et al.

NEURIPS 2025arXiv:2501.01243
8
citations

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

Zhipei Xu, Xuanyu Zhang, Runyi Li et al.

ICLR 2025arXiv:2410.02761
77
citations

FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

Lu Zhang, Jiazuo Yu, Haomiao Xiong et al.

NEURIPS 2025arXiv:2510.21311
1
citations

First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training

Lai Wei, Yuting Li, Chen Wang et al.

NEURIPS 2025arXiv:2505.22453
10
citations

From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning

Hang Du, Jiayang Zhang, Guoshun Nan et al.

ICCV 2025arXiv:2509.17040

Generative RLHF-V: Learning Principles from Multi-modal Human Preference

Jiayi Zhou, Jiaming Ji, Boyuan Chen et al.

NEURIPS 2025arXiv:2505.18531
7
citations

GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation

Lang Lin, Xueyang Yu, Ziqi Pang et al.

CVPR 2025arXiv:2504.07962
22
citations

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Qirui Chen, Shangzhe Di, Weidi Xie

AAAI 2025paperarXiv:2408.14469
27
citations

HOComp: Interaction-Aware Human-Object Composition

Dong Liang, Jinyuan Jia, Yuhao LIU et al.

NEURIPS 2025arXiv:2507.16813

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Zhihang Liu, Chen-Wei Xie, Pandeng Li et al.

CVPR 2025arXiv:2503.16036
15
citations

MP-GUI: Modality Perception with MLLMs for GUI Understanding

Ziwei Wang, Weizhi Chen, Leyang Yang et al.

CVPR 2025arXiv:2503.14021
11
citations

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu et al.

ICLR 2025arXiv:2408.04840
243
citations

Multi-step Visual Reasoning with Visual Tokens Scaling and Verification

Tianyi Bai, Zengjie Hu, Fupeng Sun et al.

NEURIPS 2025arXiv:2506.07235
14
citations

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

Sung-Yeon Park, Can Cui, Yunsheng Ma et al.

ICCV 2025arXiv:2503.12772
13
citations

RadarQA: Multi-modal Quality Analysis of Weather Radar Forecasts

Xuming He, Zhiyuan You, Junchao Gong et al.

NEURIPS 2025arXiv:2508.12291
5
citations

STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

Christian Fruhwirth-Reisinger, Dušan Malić, Wei Lin et al.

NEURIPS 2025oralarXiv:2506.06218
4
citations

To Think or Not To Think: A Study of Thinking in Rule-Based Visual Reinforcement Fine-Tuning

Ming Li, Jike Zhong, Shitian Zhao et al.

NEURIPS 2025spotlight

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence

Jie Feng, Shengyuan Wang, Tianhui Liu et al.

ICCV 2025arXiv:2506.23219
11
citations

VideoAds for Fast-Paced Video Understanding

Zheyuan Zhang, Wanying Dou, Linkai Peng et al.

ICCV 2025arXiv:2504.09282
3
citations

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Zheng Liu, Peitian Zhang et al.

CVPR 2025arXiv:2409.14485
155
citations

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

Zhengrong Yue, Shaobin Zhuang, Kunchang Li et al.

CVPR 2025arXiv:2503.12077
5
citations

CoReS: Orchestrating the Dance of Reasoning and Segmentation

Xiaoyi Bao, Siyang Sun, Shuailei Ma et al.

ECCV 2024arXiv:2404.05673
18
citations

Facial Affective Behavior Analysis with Instruction Tuning

Yifan Li, Anh Dao, Wentao Bao et al.

ECCV 2024arXiv:2404.05052
24
citations

JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups

Simindokht Jahangard, Zhixi Cai, Shiki Wen et al.

CVPR 2024arXiv:2404.04458
15
citations

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

Didi Zhu, Zhongyi Sun, Zexi Li et al.

ICML 2024arXiv:2402.12048
48
citations

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye et al.

CVPR 2024highlightarXiv:2311.04257
614
citations

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He et al.

CVPR 2024highlightarXiv:2311.17005
902
citations

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang et al.

CVPR 2024highlightarXiv:2311.17911
385
citations

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

Yi-Chia Chen, WeiHua Li, Cheng Sun et al.

ECCV 2024arXiv:2409.10542
59
citations

SegPoint: Segment Any Point Cloud via Large Language Model

Shuting He, Henghui Ding, Xudong Jiang et al.

ECCV 2024arXiv:2407.13761
37
citations

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Dongyang Liu, Renrui Zhang, Longtian Qiu et al.

ICML 2024arXiv:2402.05935
141
citations