"vision-language fusion" Papers
3 papers found
Conference
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
Qiong Wu, Wenhao Lin, Yiyi Zhou et al.
NEURIPS 2025arXiv:2411.19628
5
citations
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren et al.
ECCV 2024arXiv:2303.05499
3442
citations
Multi-Modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models
Liqi He, Zuchao Li, Xiantao Cai et al.
AAAI 2024paperarXiv:2312.08762
37
citations