VCA: Video Curious Agent for Long Video Understanding

31citations

arXiv:2412.10471

citations

#92

in ICCV 2025

of 2701 papers

Top Authors

Data Points

Top Authors

Zeyuan Yang Delin Chen Xueyang Yu Maohao Shen Chuang Gan

Topics

video understanding long video analysis curiosity-driven agents tree-search exploration vision-language models intrinsic reward mechanisms temporal complexity

Abstract

Long video understanding poses unique challenges due to their temporal complexity and low information density. Recent works address this task by sampling numerous frames or incorporating auxiliary tools using LLMs, both of which result in high computational costs. In this work, we introduce a curiosity-driven video agent with self-exploration capability, dubbed as VCA. Built upon VLMs, VCA autonomously navigates video segments and efficiently builds a comprehensive understanding of complex video sequences. Instead of directly sampling frames, VCA employs a tree-search structure to explore video segments and collect frames. Rather than relying on external feedback or reward, VCA leverages VLM's self-generated intrinsic reward to guide its exploration, enabling it to capture the most crucial information for reasoning. Experimental results on multiple long video benchmarks demonstrate our approach's superior effectiveness and efficiency.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 1, 2026

31+31

Feb 6, 2026

Feb 13, 2026