/dbot

[ICLR2024] Exploring Target Representations for Masked Autoencoders

Primary LanguagePythonApache License 2.0Apache-2.0

Masked Autoencoding with dBOT dBOT Icon

PWC

[arXiv] [BibTex]

This is the official PyTorch implementation of Exploring Target Representations for Masked Autoencoders.

News 🎉

  • January 2024 - The paper is accepted by ICLR 2024.
  • November 2022 - Release the code and pre-trained models.
  • September 2022 - Release the pre-print on arXiv.

Installation

Installation and preparation please follow MAE and iBOT. This repo is built upon python==3.6, timm==0.4.12 and pytorch==1.9.0.

Pre-Training

See pre-training instruction for details.

Downstream Tasks

See downstream instruction for details.

Pre-Trained and Fine-Tuned Models

We provide the pre-trained model (pt. model) and the finetuned model (ft. model) of dBOT in each experimental setup. You can download the pre-trained models for downstream tasks. asym. enc-dec being √ denotes that the decoder is appended after encoder with fixed delayed mask and sin-cos position embedding. It being × denotes that the vanillia ViT is used with no delayed mask and relative position embedding.

Arch. Teacher asym. enc-dec cls. det. seg. download
ViT-B ViT-B ✓ 84.5% 52.7 49.5 pt. model ft. model pt. log
ViT-L ✓ 84.6% 53.1 50.1 pt. model ft. model pt. log
ViT-H ✓ 84.6% 53.5 50.8 pt. model ft. model pt. log
CLIP-B/16 ✘ 85.7% 53.6 52.9 pt. model ft. model pt. log
ViT-L ViT-L ✓ 86.6% 56.0 54.5 pt. model ft. model pt. log
ViT-H ✓ 86.8% 56.1 55.2 pt. model ft. model pt. log
CLIP-L/14 ✘ 87.8% 56.8 56.2 pt. model ft. model pt. log
ViT-H ViT-H ✓ 87.4% - - pt. model ft. model pt. log
CLIP-L/14 ✘ 88.5% - - pt. model ft. model pt. log
ViT-H448 ViT-H ✓ 88.0% - - pt. model ft. model pt. log
CLIP-L/14 ✘ 89.1% - - pt. model ft. model pt. log

🎯 This branch is the implementation of dBOT with default asymmetric encoder-decoder architecture. For symmetric architecture with which we use CLIP as the pre-trained teacher, please see beit branch for details.

Property Analysis

To demonstrate models' differences in terms of their weigths and outputs, we conduct property analysis using averaged attention distance and singular value decomposition. We first compute the averaged attention distance for each attention head of different Transformer blocks. The results are averaged over IN1K validation set:

We also compute the percentage of tok-k (varing from 1 to 5) singular values of the embedding w.r.t each layer:

The student networks distilled from different initialized teachers exhibit similar behaviors, which clearly indicate that the teacher network does not matter with bootstrapped teachers.

Acknowledgement

This reposity is modified upon the MAE repository and iBOT repository.

License

This project is under the Apache 2.0 license as found in LICENSE file.

Citing dBOT

Please consider citing dBOT and giving a star if dBOT helps your research:

@article{liu2022exploring,
  title={Exploring target representations for masked autoencoders},
  author={Liu, Xingbin and Zhou, Jinghao and Kong, Tao and Lin, Xianming and Ji, Rongrong},
  journal={arXiv preprint arXiv:2209.03917},
  year={2022}
}