A Closer Look at Self-Supervised Lightweight Vision Transformers
Shaoru Wang, Jin Gao*, Zeming Li, Xiaoqin Zhang, Weiming Hu
ICML 2023
2023.5
: Code & models are released!2023.4
: Our paper is accepted by ICML 2023!2022.5
: Our initial version of the paper was published on Arxiv.
MAE-Lite focuses on exploring the pre-training of lightweight Vision Transformers (ViTs). This repo provide the code and models for the study in the paper.
- We provide advanced pre-training (based on MAE) and fine-tuning recipes for lightweight ViTs and demonstrate that even vanilla lightweight ViT (e.g., ViT-Tiny) beats most previous SOTA ConvNets and ViT derivatives with delicate network architecture design. We achieve 79.0% top-1 accuracy on ImageNet with vanilla ViT-Tiny (5.7M).
- We provide code for the transfer evaluation of pre-trained models on several classification tasks (e.g., Oxford 102 Flower, Oxford-IIIT Pet, FGVC Aircraft, CIFAR, etc.) and COCO detection tasks (based on ViTDet). We find that the self-supervised pre-trained ViTs work worse than the supervised pre-trained ones on data-insufficient downstream tasks.
- We provide code for the analysis tools used in the paper to examine the layer representations and attention distance & entropy for the ViTs.
- We provide code and models for our proposed knowledge distillation method for the pre-trained lightweight ViTs based on MAE, which shows superiority on the trasfer evaluation of data-insufficient classification tasks and dense prediction tasks.
Setup conda environment:
# Create environment
conda create -n mae-lite python=3.7 -y
conda activate mae-lite
# Instaill requirements
conda install pytorch==1.9.0 torchvision==0.10.0 -c pytorch -y
# Clone MAE-Lite
git clone https://github.com/wangsr126/mae-lite.git
cd mae-lite
# Install other requirements
pip3 install -r requirements.txt
python3 setup.py build develop --user
Prepare the ImageNet data in <BASE_FOLDER>/data/imagenet/imagenet_train
, <BASE_FOLDER>/data/imagenet/imagenet_val
.
To pre-train ViT-Tiny with our recommended MAE recipe:
# 4096 batch-sizes on 8 GPUs:
cd projects/mae_lite
ssl_train -b 4096 -d 0-7 -e 400 -f mae_lite_exp.py --amp \
--exp-options exp_name=mae_lite/mae_tiny_400e
Please download the pre-trained models, e.g.,
download MAE-Tiny to <BASE_FOLDER>/checkpoints/mae_tiny_400e.pth.tar
To fine-tune with the improved recipe:
# 1024 batch-sizes on 8 GPUs:
cd projects/eval_tools
ssl_train -b 1024 -d 0-7 -e 300 -f finetuning_exp.py --amp \
[--ckpt <checkpoint-path>] --exp-options pretrain_exp_name=mae_lite/mae_tiny_400e
<checkpoint-path>
: if set to<BASE_FOLDER>/checkpoints/mae_tiny_400e.pth.tar
, it will be loaded as initialization; If not set, the checkpoint at<BASE_FOLDER>/outputs/mae_lite/mae_tiny_400e/last_epoch_ckpt.pth.tar
will be loaded automatically.
download MAE-Tiny-FT to <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_300e.pth.tar
# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_300e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_eval
And you will get "Top1: 77.978"
if all right.
download MAE-Tiny-FT-RPE to <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_rpe_1000e.pth.tar
# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_rpe_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_rpe_1000e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_rpe_eval
And you will get "Top1: 79.002"
if all right.
Please refer to DISTILL.md.
Please refer to TRANSFER.md.
Please refer to DETECTION.md.
Please refer to MOCOV3.md.
Please refer to VISUAL.md.
pre-train code | pre-train epochs |
fine-tune recipe | fine-tune epoch | accuracy | ckpt |
---|---|---|---|---|---|
- | - | impr. | 300 | 75.8 | link |
mae_lite | 400 | - | - | - | link |
impr. | 300 | 78.0 | link | ||
impr.+RPE | 1000 | 79.0 | link | ||
mae_lite_distill | 400 | - | - | - | link |
impr. | 300 | 78.4 | link |
Please cite the following paper if this repo helps your research:
@misc{wang2023closer,
title={A Closer Look at Self-Supervised Lightweight Vision Transformers},
author={Shaoru Wang and Jin Gao and Zeming Li and Xiaoqin Zhang and Weiming Hu},
journal={arXiv preprint arXiv:2205.14443},
year={2023},
}
We thank for the code implementation from timm, MAE, MoCo-v3.
This repo is released under the Apache 2.0 license. Please see the LICENSE file for more information.