Granularity-Adaptive Spatial Evidence Tokenization for Video Question Answering

1citations
PDFProject
1
citations
#1733
in AAAI 2025
of 3028 papers
9
Top Authors
2
Data Points

Abstract

Video question answering plays a vital role in computer vision, and recent advances in large language models have further propelled the development of this field. However, existing video question answering techniques often face limitations in grasping fine-grained video content in spatial dimensions. It mainly stems from the fixed and low-resolution input of video frames. While some approaches using high-resolution inputs partially alleviate this problem, they introduce excessive computational burdens by encoding the entire high-resolution image. In this work, we propose a granularity-adaptive spatial evidence tokenization model for video question answering. Our method introduces multi-granular visual tokenization in the spatial dimension to produce video tokens at various granularities based on the question. It highlights spatially activated patches at low resolutions through a granularity weighting module and then adaptively encodes these activated patches at high resolution for detail supplementation. To mitigate the computational overhead associated with high-resolution frame encoding, a masking and acceleration module is developed for efficient visual tokenization. Moreover, a granularity compression module is designed to dynamically select and compress visual tokens of varying granularities based on questions. We conduct extensive experiments on 11 mainstream video question answering datasets and the experimental results demonstrate the effectiveness of our proposed method.

Citation History

Jan 27, 2026
0
Feb 4, 2026
1+1