α
Research
Alpha Leak
Conferences
Topics
Top Authors
Rankings
Browse All
EN
中
Home
/
Authors
/
Arsha Nagrani
Arsha Nagrani
26
papers
4,082
total citations
papers (26)
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
ICCV 2021
arXiv
1,472
citations
Attention Bottlenecks for Multimodal Fusion
NEURIPS 2021
arXiv
721
citations
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
CVPR 2023
arXiv
332
citations
On Scaling Up a Multilingual Vision and Language Model
CVPR 2024
arXiv
256
citations
Localizing Visual Sounds the Hard Way
CVPR 2021
arXiv
227
citations
End-to-End Generative Pretraining for Multimodal Video Captioning
CVPR 2022
arXiv
187
citations
Learning Audio-Video Modalities from Image Captions
ECCV 2022
arXiv
96
citations
Verbs in Action: Improving Verb Understanding in Video-Language Models
ICCV 2023
arXiv
89
citations
UnLoc: A Unified Framework for Video Localization Tasks
ICCV 2023
arXiv
76
citations
Streaming Dense Video Captioning
CVPR 2024
arXiv
76
citations
Look Before You Speak: Visually Contextualized Utterances
CVPR 2021
arXiv
71
citations
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
CVPR 2024
arXiv
67
citations
Speech2Action: Cross-Modal Supervision for Action Recognition
CVPR 2020
arXiv
60
citations
AutoAD: Movie Description in Context
CVPR 2023
arXiv
50
citations
AutoAD II: The Sequel - Who, When, and What in Movie Audio Description
ICCV 2023
arXiv
49
citations
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency
ECCV 2022
arXiv
42
citations
VidChapters-7M: Video Chapters at Scale
NEURIPS 2023
arXiv
41
citations
VicTR: Video-conditioned Text Representations for Activity Recognition
CVPR 2024
arXiv
38
citations
AutoAD III: The Prequel – Back to the Pixels
CVPR 2024
arXiv
34
citations
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
ECCV 2020
arXiv
30
citations
AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR
CVPR 2023
arXiv
25
citations
Composable Augmentation Encoding for Video Representation Learning
ICCV 2021
arXiv
19
citations
MINERVA: Evaluating Complex Video Reasoning
ICCV 2025
arXiv
10
citations
Flexible Frame Selection for Efficient Video Reasoning
CVPR 2025
10
citations
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation
ICCV 2025
arXiv
3
citations
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
CVPR 2025
arXiv
1
citations