Poster "multimodal large language model" Papers
9 papers found
Conference
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
Zhangheng LI, Keen You, Haotian Zhang et al.
ICLR 2025arXiv:2410.18967
45
citations
JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent
Yunlong Lin, Zixu Lin, Kunjie Lin et al.
NEURIPS 2025arXiv:2506.17612
13
citations
Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description
Mahmoud Ahmed, Junjie Fei, Jian Ding et al.
ICCV 2025arXiv:2405.18937
3
citations
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
Yicheng Xiao, Lin Song, Yukang Chen et al.
NEURIPS 2025arXiv:2505.13031
20
citations
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Miran Heo, Min-Hung Chen, De-An Huang et al.
CVPR 2025arXiv:2501.08326
9
citations
Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy
Zaijing Li, Yuquan Xie, Rui Shao et al.
CVPR 2025arXiv:2502.19902
22
citations
Referring to Any Person
Qing Jiang, Lin Wu, Zhaoyang Zeng et al.
ICCV 2025arXiv:2503.08507
14
citations
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Shuhuai Ren, Linli Yao, Shicheng Li et al.
CVPR 2024arXiv:2312.02051
372
citations
UMBRAE: Unified Multimodal Brain Decoding
Weihao Xia, Raoul de Charette, Cengiz Oztireli et al.
ECCV 2024arXiv:2404.07202
30
citations