VideoMAC: Video Masked Autoencoders Meet ConvNets

21citations

arXiv:2402.19082

citations

#1218

in CVPR 2024

of 2716 papers

Top Authors

Data Points

Top Authors

Gensheng Pei Tao Chen Xiruo Jiang 刘华峰 Liu Zeren Sun Yazhou Yao

Topics

masked autoencoders video masked modeling convolutional neural networks vision transformers video object segmentation human pose tracking sparse convolutional operators inter-frame reconstruction consistency

Abstract

Recently, the advancement of self-supervised learning techniques, like masked autoencoders (MAE), has greatly influenced visual representation learning for images and videos. Nevertheless, it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper, we propose a new approach termed as \textbf{VideoMAC}, which combines video masked autoencoders with resource-friendly ConvNets. Specifically, VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation, we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously, we present a simple yet effective masked video modeling (MVM) approach, a dual encoder architecture comprising an online encoder and an exponential moving average target encoder, aimed to facilitate inter-frame reconstruction consistency in videos. Additionally, we demonstrate that VideoMAC, empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM, outperforms ViT-based approaches on downstream tasks, including video object segmentation (+\textbf{5.2\%} / \textbf{6.4\%} $\mathcal{J}\&\mathcal{F}$), body part propagation (+\textbf{6.3\%} / \textbf{3.1\%} mIoU), and human pose tracking (+\textbf{10.2\%} / \textbf{11.1\%} PCK@0.1).

Citation History

Jan 27, 2026

Feb 13, 2026

21+1

Feb 13, 2026