FreeNet: Liberating Depth-Wise Separable Operations for Building Faster Mobile Vision Architectures
Top Authors
Abstract
In the pursuit of efficient vision architectures, substantial efforts have been devoted to optimizing operator efficiency. Depth-wise separable operators, such as DWConv, are found cheap in both FLOPs and parameters. As a result, they are increasingly incorporated into efficient backbones, trading for deeper and wider architectures to enhance performance. However, separable operators are not really fast on devices due to the discontinuous memory access requirements. In this paper, we propose FreeNets, a family of simple and efficient backbones that free the separable operation to further accelerate the running speed. We introduce sparse sampling mixers (S2-Mixer) to supersede existing separable token mixers. The S2-Mixer samples multiple segments of partially continuous signals across spatial and channel dimensions for convolutional processing, achieving extremely fast on-device speed. The sparse sampling also enables S2-Mixer to capture long-range pixel relationships from dynamic receptive fields. Furthermore, we introduce a Shift Feed-Forward Network (ShiftFFN) as a faster alternative to existing channel mixers. It utilizes a shift neck architecture that aggregates global information to shift features, enabling faster channel mixing while incorporating global pixel information. Extensive experiments demonstrate that FreeNet offers a superior accuracy-efficiency tradeoff compared to the latest efficient models. On ImageNet-1k, FreeNet-S2 outperforms the StarNet-S4 by 0.4% in top-1 accuracy, while running around 40% faster on desktop GPU and 15% faster on Mobile GPU.