Swin Transformer V2: Scaling Up Capacity and Resolution, arxiv

PaddlePaddle training/validation code and pretrained models for Swin Transformer V2.

The official pytorch implementation is here.

This implementation is developed by PaddleViT.

Comparison of the WindowAttention module between Swin Transformer V1 and Swin Transformer V2

Update

Update (2021-11-27): Complete the modification of WindowAttention module according to the original paper
- post-norm configuration
- scaled cosine attention
- log-spaced continuous relative position bias

Code modification explanation

The code modification explanation is here

Models trained from scratch using PaddleViT

Model	Acc@1	Acc@5	#Params	FLOPs	Image Size	Crop_pct	Interpolation	Link
swin_b_224			88.9M	15.3G	224	0.9	Log-CPB	coming soon

*The results are evaluated on ImageNet2012 validation set.

Requirements

Python>=3.6
yaml>=0.2.5
PaddlePaddle>=2.1.0
yacs>=0.1.8

Data

ImageNet2012 dataset is used in the following folder structure:

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── ILSVRC2012_val_00000293.JPEG
│  ├── ILSVRC2012_val_00002138.JPEG
│  ├── ......

Usage

To use the model with pretrained weights, download the .pdparam weight file and change related file paths in the following python scripts. The model config files are located in ./configs/.

For example, assume the downloaded weight file is stored in ./swin_base_patch4_window7_224.pdparams, to use the swin_base_patch4_window7_224 model in python:

from config import get_config
from swin import build_swin as build_model
# config files in ./configs/
config = get_config('./configs/swinv2_base_patch4_window7_224.yaml')
# build model
model = build_model(config)
# load pretrained weights, .pdparams is NOT needed
model_state_dict = paddle.load('./swinv2_base_patch4_window7_224')
model.set_dict(model_state_dict)

Evaluation

To evaluate Swin Transformer model performance on ImageNet2012 with a single GPU, run the following script using command line:

sh run_eval.sh

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
    -cfg='./configs/swinv2_base_patch4_window7_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \
    -eval \
    -pretrained='./swinv2_base_patch4_window7_224'

Run evaluation using multi-GPUs:

sh run_eval_multi.sh

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg='./configs/swinv2_base_patch4_window7_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \
    -eval \
    -pretrained='./swinv2_base_patch4_window7_224'

Training

To train the Swin Transformer model on ImageNet2012 with single GPU, run the following script using command line:

sh run_train.sh

CUDA_VISIBLE_DEVICES=0 \
python main_singel_gpu.py \
  -cfg='./configs/swinv2_base_patch4_window7_224.yaml' \
  -dataset='imagenet2012' \
  -batch_size=32 \
  -data_path='/dataset/imagenet' \

Run training using multi-GPUs:

sh run_train_multi.sh

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg='./configs/swinv2_base_patch4_window7_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \

Reference

@article{liu2021swin,
  title={Swin Transformer V2: Scaling Up Capacity and Resolution},
  author={Liu, Ze and Hu, Han and Lin, Yutong and Yao, Zhuliang and Xie, Zhenda and Wei, Yixuan and Ning, Jia and Cao, Yue and Zhang, Zheng and Dong, Li and others},
  journal={arXiv preprint arXiv:2111.09883},
  year={2021}
}

nku-shengzheliu/PaddlePaddle-Swin-Transformer-V2