Masked Autoencoders Are Scalable Vision Learners

A TensorFlow implementation of Masked Autoencoders Are Scalable Vision Learners [1]. Our implementation of the proposed method is available in mae-pretraining.ipynb notebook. It includes evaluation with linear probing as well. Furthermore, the notebook can be fully executed on Google Colab. Our main objective is to present the core idea of the proposed method in a minimal and readable manner. We have also prepared a blog for getting started with Masked Autoencoder easily.

Source: Masked Autoencoders Are Scalable Vision Learners

With just 100 epochs of pre-training and a fairly lightweight and asymmetric Autoencoder architecture we achieve 49.33%% accuracy with linear probing on the CIFAR-10 dataset. Our training logs and encoder weights are released in Weights and Logs. For comparison, we took the encoder architecture and trained it from scratch (refer to regular-classification.ipynb) in a fully supervised manner. This gave us ~76% test top-1 accuracy.

We note that with further hyperparameter tuning and more epochs of pre-training, we can achieve a better performance with linear-probing. Below we present some more results:

Config	Masking proportion	LP performance	Encoder weights & logs
Encoder & decoder layers: 3 & 1 Batch size: 256	0.6	44.25%	Link
Do	0.75	46.84%	Link
Encoder & decoder layers: 6 & 2 Batch size: 256	0.75	48.16%	Link
Encoder & decoder layers: 9 & 3 Batch size: 256 Weight deacy: 1e-5	0.75	49.33%	Link

^{LP denotes linear-probing. Config is mostly based on what we define in the hyperparameters
section of this notebook: mae-pretraining.ipynb.}

Notes

This project received the Google OSS Expert Prize (March 2022).

Acknowledgements

Xinlei Chen (one of the authors of the original paper)
Google Developers Experts Program and JarvisLabs for providing credits to perform extensive experimentation on A100 GPUs.

References

[1] Masked Autoencoders Are Scalable Vision Learners; He et al.; arXiv 2021; https://arxiv.org/abs/2111.06377.

max2022/mae-scalable-vision-learners

Masked Autoencoders Are Scalable Vision Learners

Notes

Acknowledgements

References