Karttikeya Mangalam

papers

6,719

total citations

papers (21)

Multiscale Vision Transformers

ICCV 2021arXiv

1,529

citations

Ego4D: Around the World in 3,000 Hours of Egocentric Video

CVPR 2022arXiv

1,511

citations

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

CVPR 2022arXiv

856

citations

It is not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction

ECCV 2020arXiv

543

citations

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

NEURIPS 2023arXiv

515

citations

From Goals, Waypoints & Paths to Long Term Human Trajectory Forecasting

ICCV 2021arXiv

327

citations

Long-term Human Motion Prediction with Scene Context

ECCV 2020arXiv

279

citations

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

CVPR 2022arXiv

247

citations

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

NEURIPS 2022arXiv

citations

Latency Matters: Real-Time Action Forecasting Transformer

CVPR 2023

citations

Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

CVPR 2023

citations

Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning

CVPR 2024

citations

Karttikeya Mangalam

papers (21)

Multiscale Vision Transformers

Ego4D: Around the World in 3,000 Hours of Egocentric Video

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

It is not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

From Goals, Waypoints & Paths to Long Term Human Trajectory Forecasting

Long-term Human Motion Prediction with Scene Context

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Sequential Modeling Enables Scalable Learning for Large Vision Models

Speculative Decoding with Big Little Decoder

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Object-Region Video Transformers

LOKI: Long Term and Key Intentions for Trajectory Prediction

Diffusion Models as Masked Autoencoders

Reversible Vision Transformers

Do Vision and Language Encoders Represent the World Similarly?

xT: Nested Tokenization for Larger Context in Large Images

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

Latency Matters: Real-Time Action Forecasting Transformer

Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning

papers (21)

Multiscale Vision Transformers

Ego4D: Around the World in 3,000 Hours of Egocentric Video

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

It is not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

From Goals, Waypoints & Paths to Long Term Human Trajectory Forecasting

Long-term Human Motion Prediction with Scene Context

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Sequential Modeling Enables Scalable Learning for Large Vision Models

Speculative Decoding with Big Little Decoder

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Object-Region Video Transformers

LOKI: Long Term and Key Intentions for Trajectory Prediction

Diffusion Models as Masked Autoencoders

Reversible Vision Transformers

Do Vision and Language Encoders Represent the World Similarly?

xT: Nested Tokenization for Larger Context in Large Images

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

Latency Matters: Real-Time Action Forecasting Transformer

Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning