/CatMAE

CatMAE

Primary LanguagePythonApache License 2.0Apache-2.0

Concatenated Masked Autoencoders as Spatial-Temporal Learner: A PyTorch Implementation

This is a PyTorch re-implementation of the paper Concatenated Masked Autoencoders as Spatial-Temporal Learner:

Requirements

Data Preparation

We use two datasets, Kinetics-400 and DAVIS-2017, for training and downstream tasks in total.

  • Kinetics-400 used in our experiment comes from here.
  • DAVIS-2017 used in our experiment comes from here

Pre-training

The arguments set in the config_file will be used first

To pre-train CatMAE-ViT-Small, run the following commond:

python main_pretrain.py --config_file configs/pretrain_catmae_vit-s-16.json

Some important arguments

  • The data_path is /path/to/Kinetics-400/videos_train/
  • The effective batch size is batch_size (256) * num of gpus (4) * accum_iter (2) = 2048
  • The effective epochs is epochs (150) * repeated_sampling (2) = 300
  • The default model is catmae_vit_small (with default patch_size and decoder_dim_dep_head), and for training VIT-B, you can alse change it to catmae_vit_base.
  • Here we use --norm_pix_loss as the target for better representation learning.
  • blr is the base learning rate. The actual lr is computed by the linear scaling rule: lr = blr * effective_batch_size / 256.

Pre-trained checkpoints

The following table provides the pre-trained checkpoints used in the paper

ViT/16-Small ViT/8-Small
pre-trained checkpoint download download
DAVIS 2017 J&Fm 62.5 70.4

Video segment in DAVIS-2017

The Video segment instruction is in DAVIS.md.

Action recognition in Kinetics-400

The Action recognition instruction is in KINETICS400.md.