/adapt-image-models

[ICLR'23] AIM: Adapting Image Models for Efficient Video Action Recognition

Primary LanguagePythonApache License 2.0Apache-2.0

AIM: Adapting Image Models for Efficient Video Action Recognition

This repo is the official implementation of "AIM: Adapting Image Models for Efficient Video Action Recognition" at ICLR 2023.

If you find our work useful in your research, please cite:

@inproceedings{
    yang2023aim,
    title={{AIM}: Adapting Image Models for Efficient Video Action Recognition},
    author={Taojiannan Yang and Yi Zhu and Yusheng Xie and Aston Zhang and Chen Chen and Mu Li},
    booktitle={The Eleventh International Conference on Learning Representations },
    year={2023},
    url={https://openreview.net/forum?id=CIoSZ_HKHS7}
}

Introduction

In this work, we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient video understanding. By freezing the pre-trained image model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation to gradually equip an image model with spatiotemporal reasoning capability. The overall structure of the proposed method is shown in the figure below.

During training, only Adapters are updated, which largely saves the training cost while still achieve competitive performance with SoTA full finetuned video models. As shown in the figure below, AIM outperforms previous SoTA methods while using less number of tunable parameters and inference GFLOPs.

Installation

The codes are based on VideoSwin, which is based on MMAction2. To prepare the environment, please follow the following instructions.

# create virtual environment
conda create -n AIM python=3.7.13
conda activate AIM

# install pytorch
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge

# install other requirements
pip install -r requirements.txt

# install mmaction2
python setup.py develop

Install Apex:

We use apex for mixed precision training by default. To install apex, please follow the instructions in the repo.

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Data Preparation

The codes are based on MMAction2. You can refer to MMAction2 for a general guideline on how to prepare the data. All the datasets (K400, K700, SSv2 and Diving-48) used in this work are supported in MMAction2.

Training

The training configs of different experiments are provided in configs/recognition/vit/. To run experiments, please use the following command. PATH/TO/CONFIG is the training config you want to use. The default training setting is 8GPU with a batchsize of 64.

bash tools/dist_train.sh <PATH/TO/CONFIG> <NUM_GPU> --test-last --validate --cfg-options work_dir=<PATH/TO/OUTPUT>

We also provide a training script in run_exp.sh. You can simply change the training config to train different models.

Key Files

Evaluation

The code will do the evaluation after training. If you would like to evaluate a model only, please use the following command,

bash tools/dist_test.sh <PATH/TO/CONFIG> <CHECKPOINT_FILE> <NUM_GPU> --eval top_k_accuracy

Models

Kinetics 400

Backbone Pretrain GFLOPs Param Tunable Param acc@1 acc@5 Views Checkpoint
ViT-B/16 CLIP 606 97 11 83.9 96.3 8x3x1 checkpoint
ViT-B/16 CLIP 1214 97 11 84.5 96.6 16x3x1 checkpoint
ViT-B/16 CLIP 2428 97 11 84.7 96.7 32x3x1 checkpoint
ViT-L/14 CLIP 2902 341 38 86.8 97.2 8x3x1 checkpoint
ViT-L/14 CLIP 5604 341 38 87.3 97.6 16x3x1 checkpoint
ViT-L/14 CLIP 11208 341 38 87.5 97.7 32x3x1 checkpoint

Kinetics 700

Backbone Pretrain GFLOPs Param Tunable Param acc@1 Views Checkpoint
ViT-B/16 CLIP 7284 97 11 76.9 32x3x3 checkpoint
ViT-L/14 CLIP 33624 341 38 80.4 32x3x3

Diving-48

Backbone Pretrain GFLOPs Param Tunable Param acc@1 Views Checkpoint
ViT-B/16 CLIP 809 97 11 88.9 32x1x1 checkpoint
ViT-L/14 CLIP 3736 341 38 90.6 32x1x1

TODO

  • Pretrained model weights

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.