Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, arxiv
We will release the implementation of Performer layer in T2T module next week. Currently, all models taking Transformer layer cause very high GPU memory as it needs huge memory to save the attention map in T2T module. After we release the Performer implementations, you can run our T2T-ViT in your 12G GPUs.
Our codes are based on the official imagenet example by PyTorch and pytorch-image-models by Ross Wightman
timm, pip install timm
torch>=1.4.0
torchvision>=0.5.0
pyyaml
Model | T2T Transformer | Top1 Acc | #params | Download |
---|---|---|---|---|
T2T-ViT_t-14 | Transformer | 80.7 | 21.5M | here |
T2T-ViT_t-19 | Transformer | 81.75 | 39.0M | here |
T2T-ViT_t-24 | Transformer | 82.2 | 64.1M | here |
T2T-ViT-7 | Performer | 71.2 | 4.2M | here |
T2T-ViT-10 | Performer | 74.1 | 5.8M | here |
T2T-ViT-12 | Performer | 75.5 | 6.8M | here |
Test the T2T-ViT_t-14 (take transformer in T2T transformer),
Download the T2T-ViT_t-14, then test it by running:
CUDA_VISIBLE_DEVICES=0 python main.py path/to/data --model T2t_vit_t_14 -b 100 --eval_checkpoint path/to/checkpoint
Test the T2T-ViT_t-24 (take transformer in T2T transformer),
Download the T2T-ViT_t-24, then test it by running:
CUDA_VISIBLE_DEVICES=0 python main.py path/to/data --model T2t_vit_t_24 -b 100 --eval_checkpoint path/to/checkpoint
Train the T2T-ViT_t-14 (take transformer in T2T transformer):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 path/to/data --model T2t_vit_t_14 -b 64 --lr 5e-4 --weight-decay .05 --img-size 224
Train the T2T-ViT_t-24 (take transformer in T2T transformer):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 path/to/data --model T2t_vit_t_24 -b 64 --lr 5e-4 --weight-decay .05 --img-size 224
Updating...
If you find this repo useful, please consider citing:
@misc{yuan2021tokenstotoken,
title={Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet},
author={Li Yuan and Yunpeng Chen and Tao Wang and Weihao Yu and Yujun Shi and Francis EH Tay and Jiashi Feng and Shuicheng Yan},
year={2021},
eprint={2101.11986},
archivePrefix={arXiv},
primaryClass={cs.CV}
}