When Open-Vocabulary Visual Question Answering Meets Causal Adapter: Benchmark and Approach
Abstract
Visual Question Answering (VQA) is a multifaceted task that integrates computer vision and natural language processing to produce textual answers from images and questions. Existing VQA benchmarks predominantly adhere to a closed-set paradigm, limiting their ability to address arbitrary, unseen answers, and thus falling short in real-world scenarios. To address this limitation, we introduce the Open-Vocabulary Visual Question Answering (OVVQA) benchmark, specifically designed to evaluate models under open-world conditions by assessing their performance on both base classes (seen, common answers) and novel classes (unseen, rare answers). In conjunction with this benchmark, we propose a model-agnostic Causal Adapter to combat the inherent bias found in current VQA tasks. Our approach leverages front-door adjustment to enhance causal reasoning, significantly improving model performance on novel categories while maintaining accuracy on base classes. Additionally, we introduce an adaptive transfer loss to facilitate the transfer of more knowledge from the pretrained model to our OVVQA task. Extensive experiments across multiple datasets validate the superiority of our method over existing state-of-the-art approaches, demonstrating its robust generalization and adaptability in open-world VQA scenarios.