Christoph Feichtenhofer

papers

20,032

total citations

papers (29)

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

CVPR 2022arXiv

856

citations

Masked Feature Prediction for Self-Supervised Visual Pre-Training

CVPR 2022arXiv

801

citations

Masked Autoencoders As Spatiotemporal Learners

NEURIPS 2022arXiv

598

citations

Scaling Language-Image Pre-Training via Masking

CVPR 2023arXiv

398

citations

Masked Autoencoders that Listen

NEURIPS 2022arXiv

395

citations

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

NEURIPS 2021arXiv

342

citations

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

CVPR 2021arXiv

288

citations

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

CVPR 2022arXiv

247

citations

Demystifying CLIP Data

ICLR 2024arXiv

216

citations

Ego-Topo: Environment Affordances From Egocentric Video

CVPR 2020arXiv

140

citations

Perception Encoder: The best visual embeddings are not at the output of the network

NEURIPS 2025arXiv

129

citations

A Multigrid Method for Efficiently Training Video Models

CVPR 2020arXiv

citations

Multiview Compressive Coding for 3D Reconstruction

CVPR 2023arXiv

citations

The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining

ICCV 2023arXiv

citations

MAViL: Masked Audio-Video Learners

NEURIPS 2023arXiv

citations

Diffusion Models as Masked Autoencoders

ICCV 2023arXiv

citations

Reversible Vision Transformers

CVPR 2022arXiv

citations

Multiview Pseudo-Labeling for Semi-Supervised Learning From Video

ICCV 2021arXiv

citations

On the Benefits of 3D Pose and Tracking for Human Action Recognition

CVPR 2023arXiv

citations

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

NEURIPS 2025arXiv

citations

CiT: Curation in Training for Effective Vision-Language Data

ICCV 2023arXiv

citations

Window Attention is Bugged: How not to Interpolate Position Embeddings

ICLR 2024arXiv

citations

An Empirical Study of Autoregressive Pre-training from Videos

ICCV 2025arXiv

citations

Christoph Feichtenhofer

papers (29)

A ConvNet for the 2020s

SAM 2: Segment Anything in Images and Videos

Multiscale Vision Transformers

Ego4D: Around the World in 3,000 Hours of Egocentric Video

X3D: Expanding Architectures for Efficient Video Recognition

TrackFormer: Multi-Object Tracking With Transformers

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Masked Autoencoders As Spatiotemporal Learners

Scaling Language-Image Pre-Training via Masking

Masked Autoencoders that Listen

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Demystifying CLIP Data

Ego-Topo: Environment Affordances From Egocentric Video

Perception Encoder: The best visual embeddings are not at the output of the network

A Multigrid Method for Efficiently Training Video Models

Multiview Compressive Coding for 3D Reconstruction

The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining

MAViL: Masked Audio-Video Learners

Diffusion Models as Masked Autoencoders

Reversible Vision Transformers

Multiview Pseudo-Labeling for Semi-Supervised Learning From Video

On the Benefits of 3D Pose and Tracking for Human Action Recognition

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

CiT: Curation in Training for Effective Vision-Language Data

Window Attention is Bugged: How not to Interpolate Position Embeddings

An Empirical Study of Autoregressive Pre-training from Videos

papers (29)

A ConvNet for the 2020s

SAM 2: Segment Anything in Images and Videos

Multiscale Vision Transformers

Ego4D: Around the World in 3,000 Hours of Egocentric Video

X3D: Expanding Architectures for Efficient Video Recognition

TrackFormer: Multi-Object Tracking With Transformers

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Masked Autoencoders As Spatiotemporal Learners

Scaling Language-Image Pre-Training via Masking

Masked Autoencoders that Listen

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Demystifying CLIP Data

Ego-Topo: Environment Affordances From Egocentric Video

Perception Encoder: The best visual embeddings are not at the output of the network

A Multigrid Method for Efficiently Training Video Models

Multiview Compressive Coding for 3D Reconstruction

The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining

MAViL: Masked Audio-Video Learners

Diffusion Models as Masked Autoencoders

Reversible Vision Transformers

Multiview Pseudo-Labeling for Semi-Supervised Learning From Video

On the Benefits of 3D Pose and Tracking for Human Action Recognition

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

CiT: Curation in Training for Effective Vision-Language Data

Window Attention is Bugged: How not to Interpolate Position Embeddings

An Empirical Study of Autoregressive Pre-training from Videos