[go: nahoru, domu]

Skip to content

Latest commit

 

History

History

Visual Attention Network

Visual Attention Network

Abstract

While originally designed for natural language processing (NLP) tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues. We further introduce a novel neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple and efficient, VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments, including image classification, object detection, semantic segmentation, instance segmentation, etc.

Results and models

ImageNet-1k

Model Pretrain resolution Params(M) Flops(G) Top-1 (%) Top-5 (%) Config Download
VAN-T From scratch 224x224 4.11 0.88 75.77 92.99 config model / log
VAN-T* From scratch 224x224 4.11 0.88 75.41 93.02 config model
VAN-S From scratch 224x224 13.86 2.52 81.03 95.56 config model / log
VAN-S* From scratch 224x224 13.86 2.52 81.01 95.63 config model
VAN-B From scratch 224x224 26.58 5.03 82.65 96.17 config model / log
VAN-B* From scratch 224x224 26.58 5.03 82.80 96.21 config model
VAN-L* From scratch 224x224 44.77 8.99 83.86 96.73 config model
VAN-B4* From scratch 224x224 60.28 12.22 84.13 96.86 config model

In the latest version of VAN on arXiv, some names are changed: VAN-b0, VAN-b1, VAN-b2, and VAN-b3 are totally equal to VAN-T, VAN-S, VAN-B, and VAN-L. We follow the original training setting provided by the official repo and the original paper. Note that models with * are converted from the official repo. We also reproduce the performances of VAN-T and VAN-S training 300 epochs.

Pre-trained Models

Model Pretrain resolution Params(M) Flops(G) Download
VAN-B4* ImageNet-21k 224x224 60.28 12.22 model
VAN-B5* ImageNet-21k 224x224 89.97 17.21 model
VAN-B6* ImageNet-21k 224x224 283.9 55.28 model

The pre-trained models on ImageNet-21k are used to fine-tune on the downstream tasks. Models with * are converted from the official repo.

Citation

@article{guo2023van,
  title={Visual Attention Network},
  author={Guo, Meng-Hao and Lu, Cheng-Ze and Liu, Zheng-Ning and Cheng, Ming-Ming and Hu, Shi-Min},
  journal={Computational Visual Media (CVMJ)},
  pages={733–-752},
  year={2023}
}