"visual question answering" Papers

96 papers found • Page 2 of 2

TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data

Jeremy Irvin, Emily Liu, Joyce Chen et al.

ICLR 2025oralarXiv:2410.06234
45
citations

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Shaofeng Yin, Ting Lei, Yang Liu

ICCV 2025arXiv:2508.03284
4
citations

Two Causally Related Needles in a Video Haystack

Miaoyu Li, Qin Chao, Boyang Li

NEURIPS 2025arXiv:2505.19853

Understanding Museum Exhibits using Vision-Language Reasoning

Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca et al.

ICCV 2025arXiv:2412.01370
1
citations

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

Qihui Zhang, Munan Ning, Zheyuan Liu et al.

CVPR 2025arXiv:2503.14941
2
citations

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

Chao Pang, Xingxing Weng, Jiang Wu et al.

AAAI 2025paperarXiv:2403.20213
54
citations

Visual Agents as Fast and Slow Thinkers

Guangyan Sun, Mingyu Jin, Zhenting Wang et al.

ICLR 2025arXiv:2408.08862
48
citations

ViUniT: Visual Unit Tests for More Robust Visual Programming

Artemis Panagopoulou, Honglu Zhou, silvio savarese et al.

CVPR 2025arXiv:2412.08859
2
citations

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

Yongshuo Zong, Ondrej Bohdal, Timothy Hospedales

ICLR 2025arXiv:2403.13164
18
citations

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang, Zhuangqun Huang, Yiwei Liao et al.

NEURIPS 2025arXiv:2511.22154

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

Junwei Luo, Yingying Zhang, Xue Yang et al.

ICCV 2025arXiv:2503.07588
14
citations

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Wenbo Hu, Yifan Xu, Yi Li et al.

AAAI 2024paperarXiv:2308.09936
192
citations

BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining

Minjun Kim, SeungWoo Song, Youhan Lee et al.

AAAI 2024paperarXiv:2401.06443
10
citations

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

Cheng Tan, Jingxuan Wei, Zhangyang Gao et al.

ECCV 2024arXiv:2311.14109
29
citations

Compositional Substitutivity of Visual Reasoning for Visual Question Answering

Chuanhao Li, Zhen Li, Chenchen Jing et al.

ECCV 2024
1
citations

Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Zaid Khan, Yun Fu

CVPR 2024arXiv:2404.10193
26
citations

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

Dachuan Shi, Chaofan Tao, Anyi Rao et al.

ICML 2024arXiv:2305.17455
43
citations

Detecting and Preventing Hallucinations in Large Vision Language Models

Anisha Gunjal, Jihan Yin, Erhan Bas

AAAI 2024paperarXiv:2308.06394
264
citations

Detection-Based Intermediate Supervision for Visual Question Answering

Yuhang Liu, Daowan Peng, Wei Wei et al.

AAAI 2024paperarXiv:2312.16012
3
citations

Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following

Qiaomu Miao, Alexandros Graikos, Jingwei Zhang et al.

ECCV 2024arXiv:2406.02774
1
citations

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Junyi Chen, Longteng Guo, Jia Sun et al.

AAAI 2024paperarXiv:2308.11971
20
citations

Extracting Training Data From Document-Based VQA Models

Francesco Pinto, Nathalie Rauschmayr, Florian Tramer et al.

ICML 2024arXiv:2407.08707
6
citations

GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering

Yifeng Zhang, Ming Jiang, Qi Zhao

ECCV 2024
2
citations

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu et al.

CVPR 2024arXiv:2310.14566
392
citations

Image Content Generation with Causal Reasoning

Xiaochuan Li, Baoyu Fan, Run Zhang et al.

AAAI 2024paperarXiv:2312.07132
12
citations

Improving Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning

Wei Li, Hehe Fan, Yongkang Wong et al.

ICML 2024

Interactive Visual Task Learning for Robots

Weiwei Gu, Anant Sah, N. Gopalan

AAAI 2024paperarXiv:2312.13219
7
citations

Language-Informed Visual Concept Learning

Sharon Lee, Yunzhi Zhang, Shangzhe Wu et al.

ICLR 2024arXiv:2312.03587
12
citations

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang et al.

ECCV 2024arXiv:2403.14624
498
citations

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Kaining Ying, Fanqing Meng, Jin Wang et al.

ICML 2024arXiv:2404.16006
163
citations

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

Didi Zhu, Zhongyi Sun, Zexi Li et al.

ICML 2024arXiv:2402.12048
48
citations

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Zhang Li, Biao Yang, Qiang Liu et al.

CVPR 2024highlightarXiv:2311.06607
392
citations

NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving

Tianwen Qian, Jingjing Chen, Linhai Zhuo et al.

AAAI 2024paperarXiv:2305.14836
271
citations

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks

Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang et al.

CVPR 2024arXiv:2312.03777
89
citations

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

Soroush Nasiriany, Fei Xia, Wenhao Yu et al.

ICML 2024arXiv:2402.07872
188
citations

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna et al.

ICML 2024arXiv:2402.07865
258
citations

Recursive Visual Programming

Jiaxin Ge, Sanjay Subramanian, Baifeng Shi et al.

ECCV 2024arXiv:2312.02249
10
citations

SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment

Ziping Ma, Furong Xu, Jian liu et al.

ICML 2024arXiv:2401.02137
7
citations

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Mingyu Zhang, Jiting Cai, Mingyu Liu et al.

ECCV 2024arXiv:2407.19666
9
citations

Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA

Chengen Lai, Shengli Song, Shiqi Meng et al.

AAAI 2024paperarXiv:2312.13594
10
citations

TrojVLM: Backdoor Attack Against Vision Language Models

Weimin Lyu, Lu Pang, Tengfei Ma et al.

ECCV 2024arXiv:2409.19232
25
citations

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Haoyu Lu, Yuqi Huo, Guoxing Yang et al.

ICLR 2024arXiv:2302.06605
55
citations

View Selection for 3D Captioning via Diffusion Ranking

Tiange Luo, Justin Johnson, Honglak Lee

ECCV 2024arXiv:2404.07984
31
citations

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Yunhao Ge, Xiaohui Zeng, Jacob Huffman et al.

CVPR 2024arXiv:2404.19752
35
citations

VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving

Yibo Liu, Zheyuan Yang, Guile Wu et al.

ECCV 2024arXiv:2407.06516
10
citations

WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering

Pingyi Chen, Chenglu Zhu, Sunyi Zheng et al.

ECCV 2024arXiv:2407.05603
25
citations