"video captioning" Papers

17 papers found

ARGUS: Hallucination and Omission Evaluation in Video-LLMs

Ruchit Rawal, Reza Shirkavand, Heng Huang et al.

ICCV 2025arXiv:2506.07371
4
citations

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng et al.

ICLR 2025oralarXiv:2408.06072
1409
citations

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Shehreen Azad, Vibhav Vineet, Yogesh S. Rawat

CVPR 2025arXiv:2503.08585
13
citations

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren et al.

CVPR 2025arXiv:2411.18042
9
citations

Modeling dynamic social vision highlights gaps between deep learning and humans

Kathy Garcia, Emalie McMahon, Colin Conwell et al.

ICLR 2025

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.

NEURIPS 2025oralarXiv:2504.13180
47
citations

Progress-Aware Video Frame Captioning

Zihui Xue, Joungbin An, Xitong Yang et al.

CVPR 2025arXiv:2412.02071
7
citations

Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Mingfei Han, Linjie Yang, Xiaojun Chang et al.

ICLR 2025arXiv:2312.10300
48
citations

VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning

Ji Soo Lee, Jongha Kim, Jeehye Na et al.

AAAI 2025paperarXiv:2501.06761
9
citations

VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

Wenhao Wang, Yi Yang

NEURIPS 2025arXiv:2503.01739
11
citations

Beyond MOT: Semantic Multi-Object Tracking

Yunhao Li, Qin Li, Hao Wang et al.

ECCV 2024arXiv:2403.05021
16
citations

COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset a Vision-Language Benchmark

Atsushi Hashimoto, Koki Maeda, Tosho Hirasawa et al.

ECCV 2024

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Nina Shvetsova, Anna Kukleva, Xudong Hong et al.

ECCV 2024arXiv:2310.04900
33
citations

Learning Video Context as Interleaved Multimodal Sequences

Qinghong Lin, Pengchuan Zhang, Difei Gao et al.

ECCV 2024arXiv:2407.21757
12
citations

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.

CVPR 2024arXiv:2402.19479
351
citations

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Haoyu Lu, Yuqi Huo, Guoxing Yang et al.

ICLR 2024arXiv:2302.06605
55
citations

Video ReCap: Recursive Captioning of Hour-Long Videos

Md Mohaiminul Islam, Vu Bao Ngan Ho, Xitong Yang et al.

CVPR 2024arXiv:2402.13250
85
citations