α
Research
Alpha Leak
Conferences
Topics
Top Authors
Rankings
Browse All
EN
中
Home
/
Authors
/
Anurag Arnab
Anurag Arnab
29
papers
5,597
total citations
papers (29)
ViViT: A Video Vision Transformer
ICCV 2021
arXiv
2,755
citations
Attention Bottlenecks for Multimodal Fusion
NEURIPS 2021
arXiv
721
citations
Simple Open-Vocabulary Object Detection with Vision Transformers
ECCV 2022
arXiv
372
citations
Multiview Transformers for Video Recognition
CVPR 2022
arXiv
273
citations
On Scaling Up a Multilingual Vision and Language Model
CVPR 2024
arXiv
256
citations
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
CVPR 2024
arXiv
193
citations
End-to-End Generative Pretraining for Multimodal Video Captioning
CVPR 2022
arXiv
187
citations
Dynamic Graph Message Passing Networks
CVPR 2020
arXiv
148
citations
Learning With Neighbor Consistency for Noisy Labels
CVPR 2022
arXiv
99
citations
Streaming Dense Video Captioning
CVPR 2024
arXiv
76
citations
Scenic: A JAX Library for Computer Vision Research and Beyond
CVPR 2022
arXiv
76
citations
UnLoc: A Unified Framework for Video Localization Tasks
ICCV 2023
arXiv
76
citations
Audiovisual Masked Autoencoders
ICCV 2023
arXiv
56
citations
Compressive Visual Representations
NEURIPS 2021
arXiv
53
citations
Unified Graph Structured Models for Video Understanding
ICCV 2021
arXiv
52
citations
VicTR: Video-conditioned Text Representations for Activity Recognition
CVPR 2024
arXiv
38
citations
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
ECCV 2020
arXiv
30
citations
Token Turing Machines
CVPR 2023
arXiv
30
citations
How Can Objects Help Action Recognition?
CVPR 2023
arXiv
27
citations
End-to-End Spatio-Temporal Action Localisation with Video Transformers
CVPR 2024
arXiv
22
citations
Time- Memory- and Parameter-Efficient Visual Adaptation
CVPR 2024
arXiv
22
citations
Flexible Frame Selection for Efficient Video Reasoning
CVPR 2025
10
citations
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames
NEURIPS 2025
arXiv
9
citations
Dense Video Object Captioning from Disjoint Supervision
ICLR 2025
arXiv
7
citations
Does Visual Pretraining Help End-to-End Reasoning?
NEURIPS 2023
arXiv
4
citations
From Image to Video: An Empirical Study of Diffusion Representations
ICCV 2025
arXiv
4
citations
Principles of Visual Tokens for Efficient Video Understanding
ICCV 2025
arXiv
1
citations
Pixel-Aligned Language Model
CVPR 2024
0
citations
TokenLearner: Adaptive Space-Time Tokenization for Videos
NEURIPS 2021
0
citations