"video question answering" Papers
57 papers found • Page 1 of 2
Conference
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang, Jihao Qiu, Lingxi Xie et al.
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding
Xue zhucun, Jiangning Zhang, Xie Xurong et al.
ALLVB: All-in-One Long Video Understanding Benchmark
Xichen Tan, Yuanjing Luo, Yunfan Ye et al.
Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models
Jean Park, Kuk Jin Jang, Basam Alasaly et al.
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
Zeyi Huang, Yuyang Ji, Xiaofang Wang et al.
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo et al.
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Han Wang, Yuxiang Nie, Yongjie Ye et al.
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
Sheng Zhou, Junbin Xiao, Qingyun Li et al.
Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders
Leibniz University Hannover, L3S Research Center Ali Rasekh, Erfan Soula, Omid Daliran et al.
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation
Yudi Shi, Shangzhe Di, Qirui Chen et al.
HD-EPIC: A Highly-Detailed Egocentric Video Dataset
Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha et al.
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad, Vibhav Vineet, Yogesh S. Rawat
How Can Objects Help Video-Language Understanding?
Zitian Tang, Shijie Wang, Junho Cho et al.
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren et al.
ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO
Daechul Ahn, Yura Choi, San Kim et al.
Learning Streaming Video Representation via Multitask Training
Yibin Yan, Jilan Xu, Shangzhe Di et al.
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Joya Chen, Yiqi Lin, Ziyun Zeng et al.
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe et al.
MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models
Hengzhi Li, Megan Tjandrasuwita, Yi R. (May) Fung et al.
MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA
Hanrong Ye, Haotian Zhang, Erik Daxberger et al.
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He, Weixi Feng, Kaizhi Zheng et al.
MR. Video: MapReduce as an Effective Principle for Long Video Understanding
Ziqi Pang, Yu-Xiong Wang
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Jeongseok Hyun, Sukjun Hwang, Su Ho Han et al.
Online Video Understanding: OVBench and VideoChat-Online
Zhenpeng Huang, Xinhao Li, Jiaqi Li et al.
OpenMMEgo: Enhancing Egocentric Understanding for LMMs with Open Weights and Data
Hao Luo, Zihao Yue, Wanpeng Zhang et al.
OSKAR: Omnimodal Self-supervised Knowledge Abstraction and Representation
Mohamed Abdelfattah, Kaouther Messaoud, Alexandre Alahi
ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks
Philip Schroeder, Ondrej Biza, Thomas Weng et al.
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models
Qingni Wang, Tiantian Geng, Zhiyuan Wang et al.
Scaling RL to Long Videos
Yukang Chen, Wei Huang, Baifeng Shi et al.
SEAL: Semantic Attention Learning for Long Video Representation
Lan Wang, Yujia Chen, Wen-Sheng Chu et al.
Seeing the Arrow of Time in Large Multimodal Models
Zihui (Sherry) Xue, Romy Luo, Kristen Grauman
Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
Mingfei Han, Linjie Yang, Xiaojun Chang et al.
TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision
Ayush Gupta, Anirban Roy, Rama Chellappa et al.
Towards Understanding Camera Motions in Any Video
Zhiqiu Lin, Siyuan Cen, Daniel Jiang et al.
VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges
Yuxuan Wang, Yiqi Song, Cihang Xie et al.
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
Jinhui Yi, Syed Talal Wasim, Yanan Luo et al.
VITED: Video Temporal Evidence Distillation
Yujie Lu, Yale Song, Lorenzo Torresani et al.
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
Shijie Zhou, Alexander Vilesov, Xuehai He et al.
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Jiashuo Yu, Yue Wu, Meng Chu et al.
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias et al.
AMEGO: Active Memory from long EGOcentric videos
Gabriele Goletto, Tushar Nagarajan, Giuseppe Averta et al.
A Unified Image Compression Method for Human Perception and Multiple Vision Tasks
Sha Guo, Sui Lin, Chen-Lin Zhang et al.
BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind
Yuanyuan Mao, Xin Lin, Qin Ni et al.
Can I Trust Your Answer? Visually Grounded Video Question Answering
Junbin Xiao, Angela Yao, Yicong Li et al.
Koala: Key Frame-Conditioned Long Video-LLM
Reuben Tan, Ximeng Sun, Ping Hu et al.
LingoQA: Video Question Answering for Autonomous Driving
Ana-Maria Marcu, Long Chen, Jan Hünermann et al.
LongVLM: Efficient Long Video Understanding via Large Language Models
Yuetian Weng, Mingfei Han, Haoyu He et al.
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
Jiaqi Xu, Bo Liu, Yunkuo Chen et al.
Multi-granularity Correspondence Learning from Long-term Noisy Videos
Yijie Lin, Jie Zhang, Zhenyu Huang et al.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li, Yali Wang, Yinan He et al.