This is an official implementation for "SimMIM: A Simple Framework for Masked Image Modeling".

By Zhenda Xie*, Zheng Zhang*, Yue Cao*, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai and Han Hu*.

This repo is the official implementation of "SimMIM: A Simple Framework for Masked Image Modeling".



SimMIM got accepted by CVPR 2022. SimMIM was used in "Swin Transformer V2" to alleviate the data hungry problem for large-scale vision model training.


Initial commits:

  1. Pre-trained and fine-tuned models on ImageNet-1K (Swin Base, Swin Large, and ViT Base) are provided.
  2. The supported code for ImageNet-1K pre-training and fine-tuneing is provided.


SimMIM is initially described in arxiv, which serves as a simple framework for masked image modeling. From systematically study, we find that simple designs of each component have revealed very strong representation learning performance: 1) random masking of the input image with a moderately large masked patch size (e.g., 32) makes a strong pre-text task; 2) predicting raw pixels of RGB values by direct regression performs no worse than the patch classification approaches with complex designs; 3) the prediction head can be as light as a linear layer, with no worse performance than heavier ones.

Main Results on ImageNet

Swin Transformer

ImageNet-1K Pre-trained and Fine-tuned Models

name pre-train epochs pre-train resolution fine-tune resolution acc@1 pre-trained model fine-tuned model
Swin-Base 100 192x192 192x192 82.8 google/config google/config
Swin-Base 100 192x192 224x224 83.5 google/config google/config
Swin-Base 800 192x192 224x224 84.0 google/config google/config
Swin-Large 800 192x192 224x224 85.4 google/config google/config
SwinV2-Huge 800 192x192 224x224 85.7 / /
SwinV2-Huge 800 192x192 512x512 87.1 / /

Vision Transformer

ImageNet-1K Pre-trained and Fine-tuned Models

name pre-train epochs pre-train resolution fine-tune resolution acc@1 pre-trained model fine-tuned model
ViT-Base 800 224x224 224x224 83.8 google/config google/config

Citing SimMIM

  title={SimMIM: A Simple Framework for Masked Image Modeling},
  author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han},
  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},

Getting Started


  • Install CUDA 11.3 with cuDNN 8 following the official installation guide of CUDA and cuDNN.

  • Setup conda environment:

# Create environment
conda create -n SimMIM python=3.8 -y
conda activate SimMIM

# Install requirements
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y

# Install apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ..

# Clone SimMIM
git clone https://github.com/microsoft/SimMIM
cd SimMIM

# Install other requirements
pip install -r requirements.txt

Evaluating provided models

To evaluate a provided model on ImageNet validation set, run:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> main_finetune.py \
--eval --cfg <config-file> --resume <checkpoint> --data-path <imagenet-path>

For example, to evaluate the Swin Base model on a single GPU, run:

python -m torch.distributed.launch --nproc_per_node 1 main_finetune.py \
--eval --cfg configs/swin_base__800ep/simmim_finetune__swin_base__img224_window7__800ep.yaml --resume simmim_finetune__swin_base__img224_window7__800ep.pth --data-path <imagenet-path>

Pre-training with SimMIM

To pre-train models with SimMIM, run:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> main_simmim.py \ 
--cfg <config-file> --data-path <imagenet-path>/train [--batch-size <batch-size-per-gpu> --output <output-directory> --tag <job-tag>]

For example, to pre-train Swin Base for 800 epochs on one DGX-2 server, run:

python -m torch.distributed.launch --nproc_per_node 16 main_simmim.py \ 
--cfg configs/swin_base__800ep/simmim_pretrain__swin_base__img192_window6__800ep.yaml --batch-size 128 --data-path <imagenet-path>/train [--output <output-directory> --tag <job-tag>]

Fine-tuning pre-trained models

To fine-tune models pre-trained by SimMIM, run:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> main_finetune.py \ 
--cfg <config-file> --data-path <imagenet-path> --pretrained <pretrained-ckpt> [--batch-size <batch-size-per-gpu> --output <output-directory> --tag <job-tag>]

For example, to fine-tune Swin Base pre-trained by SimMIM on one DGX-2 server, run:

python -m torch.distributed.launch --nproc_per_node 16 main_finetune.py \ 
--cfg configs/swin_base__800ep/simmim_finetune__swin_base__img224_window7__800ep.yaml --batch-size 128 --data-path <imagenet-path> --pretrained <pretrained-ckpt> [--output <output-directory> --tag <job-tag>]


