"visual question answering" Papers

96 papers found • Page 1 of 2

Acknowledging Focus Ambiguity in Visual Questions

Chongyan Chen, Yu-Yun Tseng, Zhuoheng Li et al.

ICCV 2025arXiv:2501.02201
4
citations

AdaDARE-gamma: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation

Jingyi Xie, Jintao Yang, Zhunchen Luo et al.

CVPR 2025

ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction

YUEJIAO SU, Yi Wang, Qiongyang Hu et al.

CVPR 2025arXiv:2504.01472
5
citations

Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering

Imad Eddine MAROUF, Enzo Tartaglione, Stéphane Lathuilière et al.

ICCV 2025arXiv:2502.04469
1
citations

A Token-level Text Image Foundation Model for Document Understanding

Tongkun Guan, Zining Wang, Pei Fu et al.

ICCV 2025arXiv:2503.02304
4
citations

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Yuhui Zhang, Yuchang Su, Yiming Liu et al.

CVPR 2025arXiv:2501.03225
23
citations

Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search

Haoran Sun, Yankai Jiang, Wenjie Lou et al.

NEURIPS 2025arXiv:2506.16962
6
citations

CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering

Tianyu Huai, Jie Zhou, Xingjiao Wu et al.

CVPR 2025highlightarXiv:2503.00413
10
citations

Consistency of Compositional Generalization Across Multiple Levels

Chuanhao Li, Zhen Li, Chenchen Jing et al.

AAAI 2025paperarXiv:2412.13636
1
citations

CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology

Yuxuan Sun, Yixuan Si, Chenglu Zhu et al.

CVPR 2025arXiv:2412.12077
23
citations

CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

Yuxuan Wang, Yijun Liu, Fei Yu et al.

AAAI 2025paperarXiv:2407.01081
7
citations

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

Hyungyung Lee, Geon Choi, Jung-Oh Lee et al.

NEURIPS 2025spotlightarXiv:2505.18087
3
citations

Directional Gradient Projection for Robust Fine-Tuning of Foundation Models

Chengyue Huang, Junjiao Tian, Brisa Maneechotesuwan et al.

ICLR 2025arXiv:2502.15895
8
citations

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Yue Yang, Shuibo Zhang, Kaipeng Zhang et al.

ICLR 2025arXiv:2410.08695
17
citations

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Shengyuan Liu, Boyun Zheng, Wenting Chen et al.

NEURIPS 2025arXiv:2505.23601
10
citations

End-to-End Multi-Modal Diffusion Mamba

Chunhao Lu, Qiang Lu, Meichen Dong et al.

ICCV 2025arXiv:2510.13253
3
citations

Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?

Yiwei Yang, Chung Peng Lee, Shangbin Feng et al.

NEURIPS 2025arXiv:2506.18322
3
citations

Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

Amir Mohammad Karimi Mamaghan, Samuele Papa, Karl H. Johansson et al.

ICLR 2025arXiv:2407.15589
12
citations

FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs

Xiaoqin Wang, Xusen Ma, Xianxu Hou et al.

CVPR 2025arXiv:2503.21457
8
citations

Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360° Firefighting Video

Aditi Tiwari, Farzaneh Masoud, Dac Nguyen et al.

NEURIPS 2025oral

FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

Chengyue Huang, Brisa Maneechotesuwan, Shivang Chopra et al.

CVPR 2025arXiv:2505.21755
4
citations

From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration

Mingyang Song, Xiaoye Qu, Jiawei Zhou et al.

CVPR 2025arXiv:2503.12821
3
citations

Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

Simon Park, Abhishek Panigrahi, Yun Cheng et al.

ICML 2025arXiv:2501.02669
9
citations

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

Jiahui Gao, Renjie Pi, Jipeng Zhang et al.

ICLR 2025arXiv:2312.11370
174
citations

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Jonathan Roberts, Kai Han, Samuel Albanie

ICCV 2025arXiv:2408.11817
3
citations

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Qirui Chen, Shangzhe Di, Weidi Xie

AAAI 2025paperarXiv:2408.14469
27
citations

HalLoc: Token-level Localization of Hallucinations for Vision Language Models

Eunkyu Park, Minyeong Kim, Gunhee Kim

CVPR 2025arXiv:2506.10286
3
citations

Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification

Yang Qin, Chao Chen, Zhihang Fu et al.

CVPR 2025arXiv:2506.11036
9
citations

Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment

Minh-Quan Le, Gaurav Mittal, Tianjian Meng et al.

ICLR 2025arXiv:2502.05153
4
citations

INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

Xin Dong, Shichao Dong, Jin Wang et al.

ICCV 2025arXiv:2507.05056
3
citations

Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models

Guosheng Zhang, Keyao Wang, Haixiao Yue et al.

AAAI 2025paperarXiv:2501.01720
6
citations

Language‑Bias‑Resilient Visual Question Answering via Adaptive Multi‑Margin Collaborative Debiasing

Huanjia Zhu, Shuyuan Zheng, Yishu Liu et al.

NEURIPS 2025

Language-Image Models with 3D Understanding

Jang Hyun Cho, Boris Ivanovic, Yulong Cao et al.

ICLR 2025arXiv:2405.03685
27
citations

LiveXiv - A Multi-Modal live benchmark based on Arxiv papers content

Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh et al.

ICLR 2025arXiv:2410.10783
12
citations

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, ZiangWu ZiangWu et al.

ICCV 2025arXiv:2411.10440
360
citations

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez et al.

CVPR 2025arXiv:2503.13399
15
citations

Mimic In-Context Learning for Multimodal Tasks

Yuchu Jiang, Jiale Fu, chenduo hao et al.

CVPR 2025arXiv:2504.08851
9
citations

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies Between Model Predictions and Human Responses in VQA

Jian Lan, Diego Frassinelli, Barbara Plank

AAAI 2025paperarXiv:2410.02773
3
citations

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

jiarui zhang, Mahyar Khayatkhoei, Prateek Chhikara et al.

ICLR 2025arXiv:2502.17422
88
citations

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Kedi Ying, Ruiping Liu, Chongyan Chen et al.

NEURIPS 2025arXiv:2510.11520

Multi-step Visual Reasoning with Visual Tokens Scaling and Verification

Tianyi Bai, Zengjie Hu, Fupeng Sun et al.

NEURIPS 2025arXiv:2506.07235
14
citations

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

Quang-Hung Le, Long Hoang Dang, Ngan Hoang Le et al.

AAAI 2025paperarXiv:2412.08125
3
citations

ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering

Duong T. Tran, Trung-Kien Tran, Manfred Hauswirth et al.

ICCV 2025arXiv:2507.16403
3
citations

Scaling Language-Free Visual Representation Learning

David Fan, Shengbang Tong, Jiachen Zhu et al.

ICCV 2025highlightarXiv:2504.01017
41
citations

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Yuhao Zhou, Yiheng Wang, Xuming He et al.

NEURIPS 2025arXiv:2506.10521
18
citations

Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

feilong tang, Chengzhi Liu, Zhongxing Xu et al.

CVPR 2025arXiv:2505.16652
25
citations

Seeking and Updating with Live Visual Knowledge

Mingyang Fu, Yuyang Peng, Dongping Chen et al.

NEURIPS 2025arXiv:2504.05288
7
citations

SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

Xin Su, Man Luo, Kris Pan et al.

ICML 2025oralarXiv:2406.19593
6
citations

STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

Divya Velayudhan, Abdelfatah Ahmed, Mohamad Alansari et al.

CVPR 2025highlightarXiv:2504.02823
2
citations

TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models

Hsin Yi Hsieh, Shang-Wei Liu, Chang-Chih Meng et al.

NEURIPS 2025
PreviousNext