iFormer: Inception Transformer (NeurIPS 2022 Oral)

This is a PyTorch implementation of iFormer proposed by our paper "Inception Transformer".

Image Classification

1. Requirements

torch>=1.7.0; torchvision>=0.8.1; timm==0.5.4; fvcore; apex-amp (if you want to use fp16);

data prepare: ImageNet with the following folder structure, you can extract ImageNet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Main results on ImageNet-1K

Model	#params	FLOPs	Image resolution	acc@1	Model
iFormer-S	20M	4.8G	224	83.4	model/config/log
iFormer-B	48M	9.4G	224	84.6	model/config/log
iFormer-L	87M	14.0G	224	84.8	model/config/log

Fine-tuning Results with larger resolution (384x384) on ImageNet-1K

Model	#params	FLOPs	Image resolution	acc@1	Model
iFormer-S	20M	16.1G	384	84.6	model/config/log
iFormer-B	48M	30.5G	384	85.7	model/config/log
iFormer-L	87M	45.3G	384	85.8	model/config/log

Training

Train iformer_small on 224

python -m torch.distributed.launch --nproc_per_node=8 train.py /dataset/imagenet \
--model iformer_small -b 128 --epochs 300 --img-size 224 --drop-path 0.2 --lr 1e-3 \
--weight-decay 0.05 --aa rand-m9-mstd0.5-inc1 --warmup-lr 1e-6 --warmup-epochs 5 \
--output checkpoint --min-lr 1e-6 --experiment iformer_small

Finetune on 384 based on the pretrained checkpoint on 224

python -m torch.distributed.launch --nproc_per_node=8 fine-tune.py /dataset/imagenet \
--model iformer_small_384 -b 64 --lr 1e-5 --min-lr 1e-6 --warmup-lr 2e-8 --warmup-epochs 0 \
--epochs 20 --img-size 384 --drop-path 0.3 --weight-decay 1e-8 --mixup 0.1 --cutmix 0.1 \
--cooldown-epochs 10 --aa rand-m9-mstd0.5-inc1 --clip-grad 1.0 --output checkpoint_fine \
--initial-checkpoint checkpoint/iformer_small/model_best.pth.tar \
--experiment iformer_small_384

Validation

python validate.py /dataset/imagenet --model iformer_small  --checkpoint checkpoint/iformer_small/model_best.pth.tar

Object Detection and Instance Segmentation

All models are based on Mask R-CNN and trained by 1x training schedule.

Backbone	#Param.	FLOPs	box mAP	mask mAP
iFormer-S	40M	263G	46.2	41.9
iFormer-B	67M	351G	48.3	43.3

Semantic Segmentation

Backbone	Method	#Param.	FLOPs	mIoU
iFormer-S	FPN	24M	181G	48.6
iFormer-S	Upernet	49M	938G	48.4

Bibtex

@inproceedings{
si2022inception,
title={Inception Transformer},
author={Chenyang Si and Weihao Yu and Pan Zhou and Yichen Zhou and Xinchao Wang and Shuicheng YAN},
booktitle={Advances in Neural Information Processing Systems},
year={2022}
}

Acknowledgment

Our implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.

pytorch-image-models, mmdetection, mmsegmentation.

Besides, Weihao Yu would like to thank TPU Research Cloud (TRC) program for the support of partial computational resources.

sail-sg/iFormer