"vision-language tasks" Papers
28 papers found
Conference
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
Tatiana Zemskova, Dmitry Yudin
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
Yunze Man, De-An Huang, Guilin Liu et al.
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning
Shengqiong Wu, Hao Fei, Liangming Pan et al.
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences
Hyojin Bahng, Caroline Chan, Fredo Durand et al.
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models
Yongxin Guo, Zhenglin Cheng, Xiaoying Tang et al.
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
Gen Luo, Yiyi Zhou, Yuxin Zhang et al.
Gatekeeper: Improving Model Cascades Through Confidence Tuning
Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha et al.
Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning
Tianyi Bai, Yuxuan Fan, Qiu Jiantao et al.
Head Pursuit: Probing Attention Specialization in Multimodal Transformers
Lorenzo Basile, Valentino Maiorca, Diego Doimo et al.
MIEB: Massive Image Embedding Benchmark
Chenghao Xiao, Isaac Chung, Imene Kerboua et al.
Mitigate the Gap: Improving Cross-Modal Alignment in CLIP
Sedigheh Eslami, Gerard de Melo
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Minheng Ni, Zhengyuan Yang, Linjie Li et al.
Release the Powers of Prompt Tuning: Cross-Modality Prompt Transfer
Ningyuan Zhang, Jie Lu, Keqiuyin Li et al.
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Seil Kang, Jinyeong Kim, Junhyeok Kim et al.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai et al.
Text4Seg: Reimagining Image Segmentation as Text Generation
Mengcheng Lan, Chaofeng Chen, Yue Zhou et al.
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Shengqiong Wu, Hao Fei, Xiangtai Li et al.
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Hao Tang, Chen-Wei Xie, Haiyang Wang et al.
Cycle-Consistency Learning for Captioning and Grounding
Ning Wang, Jiajun Deng, Mingbo Jia
DiaLoc: An Iterative Approach to Embodied Dialog Localization
Chao Zhang, Mohan Li, Ignas Budvytis et al.
Differentially Private Representation Learning via Image Captioning
Tom Sander, Yaodong Yu, Maziar Sanjabi et al.
FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning
Haokun Chen, Yao Zhang, Denis Krompass et al.
GSVA: Generalized Segmentation via Multimodal Large Language Models
Zhuofan Xia, Dongchen Han, Yizeng Han et al.
Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Zhang Li, Biao Yang, Qiang Liu et al.
ScanFormer: Referring Expression Comprehension by Iteratively Scanning
Wei Su, Peihan Miao, Huanzhang Dou et al.
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Dongyang Liu, Renrui Zhang, Longtian Qiu et al.
VIGC: Visual Instruction Generation and Correction
Théo Delemazure, Jérôme Lang, Grzegorz Pierczyński