"multimodal models" Papers
23 papers found
Conference
CASP: Compression of Large Multimodal Models Based on Attention Sparsity
Mohsen Gholami, Mohammad Akbari, Kevin Cannons et al.
Context-aware Dynamic Pruning for Speech Foundation Models
Masao Someki, Yifan Peng, Siddhant Arora et al.
Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)
SUBBA REDDY OOTA, Akshett Rai Jindal, Ishani Mondal et al.
Diff-Prompt: Diffusion-driven Prompt Generator with Mask Supervision
Weicai Yan, Wang Lin, Zirun Guo et al.
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari et al.
DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection
Li Li, Huixian Gong, Hao Dong et al.
ElasticTok: Adaptive Tokenization for Image and Video
Wilson Yan, Volodymyr Mnih, Aleksandra Faust et al.
LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
Hanyu Zhou, Gim Hee Lee
Matryoshka Multimodal Models
Mu Cai, Jianwei Yang, Jianfeng Gao et al.
MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge
yuntao du, Kailin Jiang, Zhi Gao et al.
Reconstructive Visual Instruction Tuning
Haochen Wang, Anlin Zheng, Yucheng Zhao et al.
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Seil Kang, Jinyeong Kim, Junhyeok Kim et al.
SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation
Leigang Qu, Haochuan Li, Wenjie Wang et al.
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models
Zhaoyi Liu, Huan Zhang
V2C-CBM: Building Concept Bottlenecks with Vision-to-Concept Tokenizer
Hangzhou He, Lei Zhu, Xinliang Zhang et al.
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
Sihan Yang, Runsen Xu, Chenhang Cui et al.
When Thinking Drifts: Evidential Grounding for Robust Video Reasoning
Romy Luo, Zihui (Sherry) Xue, Alex Dimakis et al.
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
Talfan Evans, Shreya Pathak, Hamza Merzic et al.
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang et al.
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions
Jin Gao, Lei Gan, Yuankai Li et al.
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li et al.
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang et al.
The Good, The Bad, and Why: Unveiling Emotions in Generative AI
CHENG LI, Jindong Wang, Yixuan Zhang et al.