A Close Look at Spatial Modeling: From Attention to Convolution. (arXiv:2212.12552v1 [cs.CV])

Vision Transformers have shown great promise recently for many vision tasks
due to the insightful architecture design and attention mechanism. By
revisiting the self-attention responses in Transformers, we empirically observe
two interesting issues. First, Vision Transformers present a queryirrelevant
behavior at deep layers, where the attention maps exhibit nearly consistent
contexts in global scope, regardless of the query patch position (also
head-irrelevant). Second, the attention maps are intrinsically sparse, few
tokens dominate the attention weights; introducing the knowledge from ConvNets
would largely smooth the attention and enhance the performance. Motivated by
above observations, we generalize self-attention formulation to abstract a
queryirrelevant global context directly and further integrate the global
context into convolutions. The resulting model, a Fully Convolutional Vision
Transformer (i.e., FCViT), purely consists of convolutional layers and firmly
inherits the merits of both attention mechanism and convolutions, including
dynamic property, weight sharing, and short- and long-range feature modeling,
etc. Experimental results demonstrate the effectiveness of FCViT. With less
than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7%
top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still
perform better than previous state-of-the-art ConvNeXt with even fewer
parameters. FCViT-based models also demonstrate promising transferability to
downstream tasks, like object detection, instance segmentation, and semantic
segmentation. Codes and models are made available at:

Source: https://arxiv.org/abs/2212.12552


Related post