A TensorFlow implementation of Masked Autoencoders Are Scalable Vision Learners [1]. Our implementation of the proposed method is available in
mae-pretraining.ipynb
notebook. It includes evaluation with linear probing as well. Furthermore, the notebook can be fully executed on Google Colab.
Our main objective is to present the core idea of the proposed method in a minimal and readable manner. We have also prepared a blog for getting
started with Masked Autoencoder easily.
With just 100 epochs of pre-training and a fairly lightweight and asymmetric Autoencoder architecture we achieve 49.33%% accuracy
with linear probing on the CIFAR-10 dataset. Our training logs and encoder weights are released in Weights and Logs
.
For comparison, we took the encoder architecture and trained it from scratch (refer to regular-classification.ipynb
) in a fully supervised manner. This gave us ~76% test top-1 accuracy.
We note that with further hyperparameter tuning and more epochs of pre-training, we can achieve a better performance with linear-probing. Below we present some more results:
Config | Masking proportion |
LP performance |
Encoder weights & logs |
---|---|---|---|
Encoder & decoder layers: 3 & 1 Batch size: 256 |
0.6 | 44.25% | Link |
Do | 0.75 | 46.84% | Link |
Encoder & decoder layers: 6 & 2 Batch size: 256 |
0.75 | 48.16% | Link |
Encoder & decoder layers: 9 & 3 Batch size: 256 Weight deacy: 1e-5 |
0.75 | 49.33% | Link |
LP denotes linear-probing. Config is mostly based on what we define in the hyperparameters
section of this notebook: mae-pretraining.ipynb
.
- Xinlei Chen (one of the authors of the original paper)
- Google Developers Experts Program and JarvisLabs for providing credits to perform extensive experimentation on A100 GPUs.
[1] Masked Autoencoders Are Scalable Vision Learners; He et al.; arXiv 2021; https://arxiv.org/abs/2111.06377.