"video understanding" Papers
80 papers found • Page 1 of 2
Conference
$F^3Set$: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos
Zhaoyu Liu, Kan Jiang, Murong Ma et al.
Accident Anticipation via Temporal Occurrence Prediction
Tianhao Zhao, Yiyang Zou, Zihao Mao et al.
Action-Agnostic Point-Level Supervision for Temporal Action Detection
Shuhei M. Yoshida, Takashi Shibata, Makoto Terao et al.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Orr Zohar, Xiaohan Wang, Yann Dubois et al.
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai, Enxin Song, Yilun Du et al.
Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models
Eunseop Yoon, Hee Suk Yoon, Mark Hasegawa-Johnson et al.
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model
Benlin Liu, Yuhao Dong, Yiqin Wang et al.
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
Nikita Karaev, Iurii Makarov, Jianyuan Wang et al.
DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding
Xiaoyi Bao, Chen-Wei Xie, Hao Tang et al.
FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Leqi Shen, Guoqiang Gong, Tao He et al.
Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression
Juan Chen, Honglin liu, Yingying Ao et al.
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad, Vibhav Vineet, Yogesh S. Rawat
High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation
Runyang Feng, Hyung Jin Chang, Tze Ho Elden Tse et al.
HIPPO-VIDEO : Simulating Watch Histories with Large Language Models for History-Driven Video Highlighting
Jeongeun Lee, Youngjae Yu, Dongha Lee
HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models
Haoran Li, Yingjie Qin, Baoyuan Ou et al.
Human Motion Instruction Tuning
Lei Li, Sen Jia, Jianhao Wang et al.
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
Zhihang Liu, Chen-Wei Xie, Pandeng Li et al.
Improve Temporal Reasoning in Multimodal Large Language Models via Video Contrastive Decoding
Daiqing Qi, Dongliang Guo, Hanzhang Yuan et al.
Interacted Object Grounding in Spatio-Temporal Human-Object Interactions
Xiaoyang Liu, Boran Wen, Xinpeng Liu et al.
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors
Duo Zheng, shijia Huang, Yanyang Li et al.
Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering
JIANFENG CAI, Jiale Hong, Zongmeng Zhang et al.
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Yilun Zhao, Lujing Xie, Haowei Zhang et al.
Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation
Thong Thanh Nguyen, Xiaobao Wu, Yi Bin et al.
Multiple Object Tracking as ID Prediction
Ruopeng Gao, Ji Qi, Limin Wang
Multi-Scale Contrastive Learning for Video Temporal Grounding
Thong Thanh Nguyen, Yi Bin, Xiaobao Wu et al.
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs
Zijia Zhao, Haoyu Lu, Yuqi Huo et al.
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Miran Heo, Min-Hung Chen, De-An Huang et al.
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun et al.
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.
Prediction-Feedback DETR for Temporal Action Detection
Jihwan Kim, Miso Lee, Cheol-Ho Cho et al.
Progress-Aware Video Frame Captioning
Zihui Xue, Joungbin An, Xitong Yang et al.
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
Yiyang Zhou, Yangfan He, Yaofeng Su et al.
Rethinking Pseudo-Label Guided Learning for Weakly Supervised Temporal Action Localization from the Perspective of Noise Correction
Quan Zhang, Yuxin Qi, Xi Tang et al.
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation
Junyu Xie, Tengda Han, Max Bain et al.
Temporal Action Localization with Cross Layer Task Decoupling and Refinement
Qiang Li, Di Liu, Jun Kong et al.
The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning
Xinyang Zhou, Fanyue Wei, Lixin Duan et al.
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Ye Wang, Ziheng Wang, Boshen Xu et al.
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
Ziyao Shangguan, Chuhan Li, Yuxuan Ding et al.
TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring
Zhu Xu, Ting Lei, Zhimin Li et al.
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Junqi Ge, Ziyi Chen, Jintao Lin et al.
VCA: Video Curious Agent for Long Video Understanding
Zeyuan Yang, Delin Chen, Xueyang Yu et al.
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
Ziyang Luo, Haoning Wu, Dongxu Li et al.
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
Zongxia Li, Xiyang Wu, Guangyao Shi et al.
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Lawrence Jang, Yinheng Li, Dan Zhao et al.
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian, Zhaoyang Liu, Ruibin Yuan et al.
Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection
Sung Jin Um, Dongjin Kim, Sangmin Lee et al.
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen, Haozhe Zhao, Tianyu Liu et al.
Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video
Zhaobo Qi, Yibo Yuan, Xiaowen Ruan et al.
ContPhy: Continuum Physical Concept Learning and Reasoning from Videos
Zhicheng Zheng, Xin Yan, Zhenfang Chen et al.