/resformer

Official PyTorch implementation of ResFormer: Scaling ViTs with Multi-Resolution Training, CVPR2023

Primary LanguagePythonMIT LicenseMIT

ResFormer: Scaling ViTs with Multi-Resolution Training

Official PyTorch implementation of ResFormer: Scaling ViTs with Multi-Resolution Training, CVPR2023 | Paper

Overview

We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions. In particular, ResFormer operates on replicated images of different resolutions and enforces a scale consistency loss to engage interactive information across different scales. More importantly, to alternate among varying resolutions effectively, especially novel ones in testing, we propose a global-local positional embedding strategy that changes smoothly conditioned on input sizes.

Installation

Image Classification

pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install timm==0.5.4
pip install tensorboard

Scripts

Training on ImageNet-1k

The default script for training ResFormer-S-MR with training resolutions of 224, 160 and 128.

python -m torch.distributed.launch --nproc_per_node 8 main.py  --data-path  YOUR_DATA_PATH  --model resformer_small_patch16  --output_dir YOUR_OUTPUT_PATH --batch-size 128 --pin-mem --input-size 224 160 128 --auto-resume  --distillation-type 'smooth-l1' --distillation-target cls --sep-aug

The default script for training ResFormer-B-MR with training resolutions of 224, 160 and 128.

python -m torch.distributed.launch --nproc_per_node 8 main.py  --data-path  YOUR_DATA_PATH  --model resformer_base_patch16  --output_dir YOUR_OUTPUT_PATH --batch-size 128 --pin-mem --input-size 224 160 128 --auto-resume  --distillation-type 'smooth-l1' --distillation-target cls --sep-aug --epochs 200 --drop-path 0.2  --lr 8e-4 --warmup-epochs 20 --clip-grad 5.0 --epochs 200  --cooldown-epochs 0  

Model Zoo

Image Classification on ImageNet-1k

name Training Res Top-1(96) Top-1(128) Top-1(160) Top-1(224) Top-1(384) Top-1(512) model
ResFormer-T-MR 128, 160, 224 61.40 67.78 71.09 73.85 75.04 73.77 google
ResFormer-S-MR 128, 160, 224 73.59 78.24 80.39 82.16 82.72 82.00 google
ResFormer-S-MR 128, 224, 384 72.92 77.84 80.09 82.28 83.70 83.86 google
ResFormer-B-MR 128, 160, 224 75.86 79.74 81.52 82.72 83.29 82.63 google

Catalog

  • image classification
  • object detection
  • semantic segmentation
  • action recognition

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Citation

@inproceedings{tian2022resformer,
  title={ResFormer: Scaling ViTs with Multi-Resolution Training},
  author={Tian, Rui and Wu, Zuxuan and Dai, Qi and Hu, Han and Qiao, Yu and Jiang, Yu-Gang},
  booktitle={CVPR},
  year={2023}
}