Yossi Gandelsman*, Yu Sun*, Xinlei Chen and Alexei A. Efros
We provide an environment.yml
file that can be used to create a Conda environment:
conda env create -f environment.yml
conda activate ttt
To train a model on the main task, please use the code base from Masked Autoencoders Are Scalable Vision Learners. We provided a self-contained code for training here as well. Please run:
TIME=$(date +%s%3N)
DATA_PATH='...'
OUTPUT_DIR='...'
python -m torch.distributed.launch --nproc_per_node=8 main_pretrain.py \
--data_path ${DATA_PATH} \
--model mae_vit_large_patch16 \
--input_size 224 \
--batch_size 64 \
--mask_ratio 0.75 \
--warmup_epochs 40 \
--epochs 800 \
--blr 1e-3 \
--save_ckpt_freq 100 \
--output_dir ${OUTPUT_DIR} \
--dist_url "file://$OUTPUT_DIR/$TIME"
Alternatively, you can use a pretrained large VIT model from here:
mkdir checkpoints
cd checkpoints
wget https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_large_full.pth
To train the classification head, run this:
TIME=$(date +%s%3N)
DATA_PATH='...'
OUTPUT_DIR='...'
RESUME_MODEL='checkpoints/mae_pretrain_vit_large_full.pth'
python -m torch.distributed.launch --nproc_per_node=8 main_prob.py \
--batch_size 32 \
--accum_iter 4 \
--model mae_vit_large_patch16 \
--finetune ${RESUME_MODEL} \
--epochs 20 \
--input_size 224 \
--head_type vit_head \
--blr 1e-3 \
--norm_pix_loss \
--weight_decay 0.2 \
--dist_eval --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR}
Alternatively, you can use a pretrained model (with slightly different parameters) from here:
mkdir checkpoints
cd checkpoints
wget https://dl.fbaipublicfiles.com/mae/ttt/prob_lr1e-3_wd.2_blk12_ep20.pth
To train the model, you will first need to download the imagenet-c
dataset, from here.
After extracting the dataset, you can run test-time training on each of the test sets:
DATA_PATH_BASE='path_to_imagenet-c'
DATASET='gaussian_noise'
LEVEL='5'
RESUME_MODEL='checkpoints/mae_pretrain_vit_large_full.pth'
RESUME_FINETUNE='checkpoints/prob_lr1e-3_wd.2_blk12_ep20.pth'
OUTPUT_DIR_BASE='...'
python main_test_time_training.py \
--data_path "$DATA_PATH_BASE/$DATASET/$LEVEL" \
--model mae_vit_large_patch16 \
--input_size 224 \
--batch_size 128 \
--steps_per_example 20 \
--mask_ratio 0.75 \
--blr 1e-2 \
--norm_pix_loss \
--optimizer_type 'sgd' \
--classifier_depth 12 \
--head_type "vit_head" \
--single_crop \
--dataset_name "imagenet_c" \
--output_dir "$OUTPUT_DIR_BASE/$DATASET/" \
--dist_url "file://$OUTPUT_DIR_BASE/$TIME" \
--finetune_mode 'encoder' \
--resume_model ${RESUME_MODEL} \
--resume_finetune ${RESUME_FINETUNE}
To evaluate the model without applying test-time training, run:
DATA_PATH_BASE='path_to_imagenet-c'
DATASET='gaussian_noise'
LEVEL='5'
RESUME_MODEL='checkpoints/mae_pretrain_vit_large_full.pth'
RESUME_FINETUNE='checkpoints/prob_lr1e-3_wd.2_blk12_ep20.pth'
OUTPUT_DIR_BASE='...'
python test_without_adaptation.py \
--data_path "$DATA_PATH_BASE/$DATASET/$LEVEL" \
--model mae_vit_large_patch16 \
--input_size 224 \
--resume_model ${RESUME_MODEL} \
--resume_finetune ${RESUME_FINETUNE} \
--output_dir "$OUTPUT_DIR_BASE/$DATASET/baseline" \
--classifier_depth 12 \
--head_type "vit_head"
@inproceedings{
gandelsman2022testtime,
title={Test-Time Training with Masked Autoencoders},
author={Yossi Gandelsman and Yu Sun and Xinlei Chen and Alexei A Efros},
booktitle={Advances in Neural Information Processing Systems},
editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
year={2022},
url={https://openreview.net/forum?id=SHMi1b7sjXk}
}