Video Swin Transformer
By Ze Liu*, Jia Ning*, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin and Han Hu.
This repo is the official implementation of "Video Swin Transformer". It is based on mmaction2.
Updates
06/25/2021 Initial commits
Introduction
Video Swin Transformer is initially described in "Video Swin Transformer", which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including action recognition (84.9
top-1 accuracy on Kinetics-400 and 86.1
top-1 accuracy on Kinetics-600 with ~20x
less pre-training data and ~3x
smaller model size) and temporal modeling (69.6
top-1 accuracy on Something-Something v2).
Results and Models
Kinetics 400
Backbone | Pretrain | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |
---|---|---|---|---|---|---|---|---|---|
Swin-T | ImageNet-1K | 30ep | 224 | 78.8 | 93.6 | 28M | 87.9G | config | github/baidu |
Swin-S | ImageNet-1K | 30ep | 224 | 80.6 | 94.5 | 50M | 165.9G | config | github/baidu |
Swin-B | ImageNet-1K | 30ep | 224 | 80.6 | 94.6 | 88M | 281.6G | config | github/baidu |
Swin-B | ImageNet-22K | 30ep | 224 | 82.7 | 95.5 | 88M | 281.6G | config | github/baidu |
Kinetics 600
Backbone | Pretrain | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |
---|---|---|---|---|---|---|---|---|---|
Swin-B | ImageNet-22K | 30ep | 224 | 84.0 | 96.5 | 88M | 281.6G | config | github/baidu |
Something-Something V2
Backbone | Pretrain | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |
---|---|---|---|---|---|---|---|---|---|
Swin-B | Kinetics 400 | 60ep | 224 | 69.6 | 92.7 | 89M | 320.6G | config | github/baidu |
Notes:
- Pre-trained image models can be downloaded from Swin Transformer for ImageNet Classification.
- The pre-trained model of SSv2 could be downloaded at github/baidu.
- Access code for baidu is
swin
.
Usage
Installation
Please refer to install.md for installation.
We also provide docker file cuda10.1 (image url) and cuda11.0 (image url) for convenient usage.
Data Preparation
Please refer to data_preparation.md for a general knowledge of data preparation. The supported datasets are listed in supported_datasets.md.
We also share our Kinetics-400 annotation file k400_val, k400_train for better comparison.
Inference
# single-gpu testing
python tools/test.py <CONFIG_FILE> <CHECKPOINT_FILE> --eval top_k_accuracy
# multi-gpu testing
bash tools/dist_test.sh <CONFIG_FILE> <CHECKPOINT_FILE> <GPU_NUM> --eval top_k_accuracy
Training
To train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:
# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
For example, to train a Swin-T
model for Kinetics-400 dataset with 8 gpus, run:
bash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL>
To train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:
# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
For example, to train a Swin-B
model for SSv2 dataset with 8 gpus, run:
bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=<PRETRAIN_MODEL>
Note: use_checkpoint
is used to save GPU memory. Please refer to this page for more details.
Apex (optional):
We use apex for mixed precision training by default. To install apex, use our provided docker or run:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
If you would like to disable apex, comment out the following code block in the configuration files:
# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
type="DistOptimizerHook",
update_interval=1,
grad_clip=None,
coalesce=True,
bucket_size_mb=-1,
use_fp16=True,
)
Citation
If you find our work useful in your research, please cite:
@article{liu2021video,
title={Video Swin Transformer},
author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
journal={arXiv preprint arXiv:2106.13230},
year={2021}
}
@article{liu2021Swin,
title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
journal={arXiv preprint arXiv:2103.14030},
year={2021}
}
Other Links
Image Classification: See Swin Transformer for Image Classification.
Object Detection: See Swin Transformer for Object Detection.
Semantic Segmentation: See Swin Transformer for Semantic Segmentation.
Self-Supervised Learning: See MoBY with Swin Transformer.