"multimodal instruction following" Papers
3 papers found
Conference
LaViDa: A Large Diffusion Model for Vision-Language Understanding
Shufan Li, Konstantinos Kallidromitis, Hritik Bansal et al.
NEURIPS 2025spotlight
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
Jie Yang, Xuesong Niu, Nan Jiang et al.
ECCV 2024arXiv:2407.12435
23
citations
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action
Jiasen Lu, Christopher Clark, Sangho Lee et al.
CVPR 2024highlightarXiv:2312.17172
280
citations