Poster "multimodal understanding" Papers

21 papers found

ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking

Lequan Lin, Dai Shi, Andi Han et al.

NEURIPS 2025arXiv:2511.09833

ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio–Language Models

Weifei Jin, Yuxin Cao, Junjie Su et al.

NEURIPS 2025arXiv:2510.26096
1
citations

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok et al.

ICLR 2025arXiv:2410.18325
19
citations

Can LLMs Understand Time Series Anomalies?

Zihao Zhou, Rose Yu

ICLR 2025arXiv:2410.05440
35
citations

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Size Wu, Wenwei Zhang, Lumin Xu et al.

ICCV 2025arXiv:2503.21979
37
citations

HMVLM:Human Motion-Vision-Language Model via MoE LoRA

Lei Hu, Yongjing Ye, Shihong Xia

NEURIPS 2025

LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

Li Huaqiu, Yong Wang, Tongwen Huang et al.

ICCV 2025arXiv:2507.00790
4
citations

METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

Yuchen Liu, Yaoming Wang, Bowen Shi et al.

ICCV 2025arXiv:2507.20842
1
citations

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Rongchang Xie, Chen Du, Ping Song et al.

ICCV 2025arXiv:2411.17762
27
citations

One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head

Junhao Xia, Haotian Zhu, Shuchao Pang et al.

NEURIPS 2025

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Pengfei Zhou, Xiaopeng Peng, Jiajun Song et al.

CVPR 2025arXiv:2411.18499
20
citations

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai et al.

ICLR 2025arXiv:2408.12528
484
citations

Teaching Human Behavior Improves Content Understanding Abilities Of VLMs

SOMESH SINGH, Harini S I, Yaman Singla et al.

ICLR 2025
2
citations

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Liao Qu, Huichao Zhang, Yiheng Liu et al.

CVPR 2025arXiv:2412.03069
128
citations

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Xiangyu Wang, Donglin Yang, ziqin wang et al.

ICLR 2025arXiv:2410.07087
55
citations

Two Causally Related Needles in a Video Haystack

Miaoyu Li, Qin Chao, Boyang Li

NEURIPS 2025arXiv:2505.19853

X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

jian ma, Qirong Peng, Xu Guo et al.

ICCV 2025arXiv:2503.06134
5
citations

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Zehan Wang, Ziang Zhang, xize cheng et al.

ICML 2024arXiv:2405.04883
19
citations

Generative Multimodal Models are In-Context Learners

Quan Sun, Yufeng Cui, Xiaosong Zhang et al.

CVPR 2024arXiv:2312.13286
438
citations

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Kaining Ying, Fanqing Meng, Jin Wang et al.

ICML 2024arXiv:2404.16006
163
citations

NExT-Chat: An LMM for Chat, Detection and Segmentation

Ao Zhang, Yuan Yao, Wei Ji et al.

ICML 2024arXiv:2311.04498
78
citations