"multimodal models" Papers

23 papers found

CASP: Compression of Large Multimodal Models Based on Attention Sparsity

Mohsen Gholami, Mohammad Akbari, Kevin Cannons et al.

CVPR 2025highlightarXiv:2503.05936
4
citations

Context-aware Dynamic Pruning for Speech Foundation Models

Masao Someki, Yifan Peng, Siddhant Arora et al.

ICLR 2025
7
citations

Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)

SUBBA REDDY OOTA, Akshett Rai Jindal, Ishani Mondal et al.

ICLR 2025arXiv:2505.20029
5
citations

Diff-Prompt: Diffusion-driven Prompt Generator with Mask Supervision

Weicai Yan, Wang Lin, Zirun Guo et al.

ICLR 2025arXiv:2504.21423
7
citations

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari et al.

CVPR 2025arXiv:2503.02175
57
citations

DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection

Li Li, Huixian Gong, Hao Dong et al.

CVPR 2025highlightarXiv:2411.08227
14
citations

ElasticTok: Adaptive Tokenization for Image and Video

Wilson Yan, Volodymyr Mnih, Aleksandra Faust et al.

ICLR 2025arXiv:2410.08368
23
citations

LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs

Hanyu Zhou, Gim Hee Lee

ICCV 2025arXiv:2503.06934
3
citations

Matryoshka Multimodal Models

Mu Cai, Jianwei Yang, Jianfeng Gao et al.

ICLR 2025arXiv:2405.17430
63
citations

MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge

yuntao du, Kailin Jiang, Zhi Gao et al.

ICLR 2025arXiv:2502.19870
10
citations

Reconstructive Visual Instruction Tuning

Haochen Wang, Anlin Zheng, Yucheng Zhao et al.

ICLR 2025arXiv:2410.09575
35
citations

See What You Are Told: Visual Attention Sink in Large Multimodal Models

Seil Kang, Jinyeong Kim, Junhyeok Kim et al.

ICLR 2025arXiv:2503.03321
61
citations

SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation

Leigang Qu, Haochuan Li, Wenjie Wang et al.

CVPR 2025arXiv:2412.05818
10
citations

Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models

Zhaoyi Liu, Huan Zhang

CVPR 2025arXiv:2502.18290
9
citations

V2C-CBM: Building Concept Bottlenecks with Vision-to-Concept Tokenizer

Hangzhou He, Lei Zhu, Xinliang Zhang et al.

AAAI 2025paperarXiv:2501.04975
10
citations

VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization

Sihan Yang, Runsen Xu, Chenhang Cui et al.

ICCV 2025arXiv:2508.05211
5
citations

When Thinking Drifts: Evidential Grounding for Robust Video Reasoning

Romy Luo, Zihui (Sherry) Xue, Alex Dimakis et al.

NEURIPS 2025arXiv:2510.06077
4
citations

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

Talfan Evans, Shreya Pathak, Hamza Merzic et al.

ECCV 2024arXiv:2312.05328
25
citations

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang et al.

ICML 2024arXiv:2401.13311
20
citations

Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

Jin Gao, Lei Gan, Yuankai Li et al.

ECCV 2024arXiv:2408.01091
4
citations

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li et al.

CVPR 2024highlightarXiv:2310.03744
4359
citations

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks

Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang et al.

CVPR 2024arXiv:2312.03777
89
citations

The Good, The Bad, and Why: Unveiling Emotions in Generative AI

CHENG LI, Jindong Wang, Yixuan Zhang et al.

ICML 2024arXiv:2312.11111
23
citations