This repository is the official implementation of "Fine-grained Key-Value Memory Enhanced Predictor for Video Representation Learning", presented at ACM Multimedia 2023. This codebase is designed to facilitate video representation learning by leveraging a novel Fine-grained Key-Value Memory Enhanced Predictor (FGKVMem) approach, enhancing the predictive capabilities for video understanding tasks. Our implementation builds on the SlowFast architecture, extending it with our FGKVMemPred module to achieve superior performance in video representation learning methods.
Fine-grained Key-Value Memory Enhanced Predictor for Video Representation Learning (ACM MM 2023) [PDF]
Xiaojie Li^1,2, Jianlong Wu*^1 (Corresponding Author), Shaowei He^1, Shuo Kang^3, Yue Yu^2, Liqiang Nie^1, Min Zhang^1
^1Harbin Institute of Technology, Shenzhen, ^2Peng Cheng Laboratory, ^3Sensetime Research
To get started with our project, please follow these setup instructions:
-
Environment Setup with Conda: Create a Conda environment specifically for this project to manage dependencies efficiently.
conda create -n pytorch_env python=3.8 pytorch=1.13.1 torchvision=0.14.1 torchaudio=0.13.1 cudatoolkit=11.7 -c pytorch -c nvidia
-
Install Required Python Packages: Install all necessary Python packages listed in
requirements.txt
using pip.pip install -r requirements.txt
-
Set Up Detectron2 with Modifications: Clone the Detectron2 repository and install it. Then, replace certain files with our modified versions to enhance functionality.
git clone https://github.com/facebookresearch/detectron2.git pip install -e detectron2 # Replace files in pytorchvideo package with our modified versions cp tools/modified_files/distributed.py $(python -c 'import pytorchvideo; print(pytorchvideo.__path__[0])')/layers/ cp tools/modified_files/batch_norm.py $(python -c 'import pytorchvideo; print(pytorchvideo.__path__[0])')/layers/
-
Clone FGKVMemPred Repository: Get our project repository containing all the necessary code and scripts.
git clone https://github.com/xiaojieli0903/FGKVMemPred_video.git
-
Add Project to PYTHONPATH: Ensure Python can find the project modules by adding it to your PYTHONPATH.
export PYTHONPATH=$(pwd)/FGKVMemPred_video/slowfast:$PYTHONPATH
-
Build FGKVMemPred_video: Compile and install the project to make sure everything is set up correctly.
cd FGKVMemPred_video python setup.py build develop
This guide provides comprehensive steps for preparing the UCF101, HMDB51, and Kinetics400 datasets for use in the FGKVMemPred Video Understanding project. Follow these instructions to ensure your datasets are correctly formatted and ready for model training and evaluation.
✨ UCF101
-
Download Videos:
- Acquire the UCF101 dataset from the official source.
-
Structure the Dataset: Organize the dataset to follow this directory structure:
{your_path}/UCF101/videos/{action_class}/{video_name}.avi {your_path}/UCF101/ucfTrainTestlist/trainlist{01/02/03}.txt {your_path}/UCF101/ucfTrainTestlist/testlist{01/02/03}.txt
-
Symbolic Links for Dataset Splits: Create symbolic links to the dataset split lists for streamlined script processing:
ln -s {your_path}/UCF101/ucfTrainTestlist/trainlist01.txt {your_path}/UCF101/train.csv ln -s {your_path}/UCF101/ucfTrainTestlist/testlist01.txt {your_path}/UCF101/test.csv ln -s {your_path}/UCF101/ucfTrainTestlist/testlist01.txt {your_path}/UCF101/val.csv
✨ HMDB51
-
Download Videos:
- Obtain the HMDB51 dataset from its official source.
-
Structure the Dataset: Ensure the HMDB51 dataset is organized as follows:
{your_path}/HMDB51/videos/{action_class}/{video_name}.avi {your_path}/HMDB51/split/testTrainMulti_7030_splits/{action_class}_test_split{1/2/3}.txt
-
Generate and Resize CSV Files: Use the provided script to generate CSV files for training, testing, and validation, and resize videos:
python tools/dataset_tools/process_hmdb51.py {your_path}/HMDB51/split/testTrainMulti_7030_splits/ {your_path}/HMDB51/ <split_index> python tools/dataset_tools/resize_videos.py {your_path}/HMDB51/ videos {your_path}/HMDB51/train.csv python tools/dataset_tools/resize_videos.py {your_path}/HMDB51/ videos {your_path}/HMDB51/val.csv
✨ Kinetics400
-
Download Videos:
- Download the Kinetics400 dataset using the ActivityNet provided scripts.
-
Structure and Resize the Dataset: Organize and resize the Kinetics400 dataset to conform to the required structure and video dimensions:
{your_path}/Kinetics400/videos/{split}/{action_class}/{video_name}.mp4 {your_path}/Kinetics/kinetics_{split}/kinetics_{split}.csv
- Use the script to resize videos to a short edge size of 256 pixels:
python tools/dataset_tools/resize_videos.py {your_path}/Kinetics-400/ {split} {your_path}/Kinetics/kinetics_{split}/kinetics_{split}.csv
✨ Notes
- Ensure the
{your_path}
placeholder is replaced with the actual path to your datasets. - The CSV files should list video paths and their corresponding labels, formatted as
'video_path label'
. - The resizing step is crucial for standardizing input sizes across datasets, facilitating more efficient training and evaluation.
Once you've set up your environment and prepared your datasets, you're ready to dive into model training and evaluation. Before you begin, make sure to activate the pytorch_env
Conda environment:
conda activate pytorch_env
🎈Pretraining
Our project utilizes the dist_pretrain.sh
script for initiating self-supervised training sessions. This script requires you to specify several parameters:
$CONFIG
: The path to your model configuration file.$PORT
: An available port number for distributed training.$GPUS
: The number of GPUs you wish to utilize for training.$LIST_PATH
: The directory path where yourtrain.csv
andval.csv
files are located.$PATH_PREFIX
: The prefix to append to each video path specified in your CSV files.
To launch a training session, use the following syntax:
sh scripts/dist_pretrain.sh $CONFIG $PORT $GPUS $LIST_PATH $PATH_PREFIX
For specific training configurations, refer to these examples (remember to adjust paths and parameters as necessary for your environment):
-
Pre-Training with an MLP Predictor:
sh scripts/dist_pretrain.sh configs/ucf101/r3d18/BYOL_SlowR18_16x4_112_400e_bs64_lr1.2_r3d18.yaml 12345 4 /path/to/ucf101/csv /path/to/ucf101/videos
-
Pre-Training with an Enhanced Key-Value Memory Predictor:
sh scripts/dist_pretrain.sh configs/ucf101/r3d18/BYOL_SlowR18_16x4_112_400e_bs64_lr1.2_r3d18_h1_mem4096_inproj_codealign_dino_dot_t0.05_synccenter.yaml 12345 4 /path/to/ucf101/csv /path/to/ucf101/videos
🎈Evaluation
To evaluate the performance of our self-supervised learning methods, we use action recognition as a downstream task. This involves initializing models with pre-trained parameters and either fine-tuning the entire network or conducting a linear probe.
To train the action classifier utilizing the pretrained weights ($CHECKPOINT
), execute one of the following commands based on your dataset and evaluation method:
-
Fine-tuning:
sh scripts/run_finetune_ucf101.sh $CONFIG $CHECKPOINT $LIST_PATH $PATH_PREFIX $PORT sh scripts/run_finetune_HMDB51.sh $CONFIG $CHECKPOINT $LIST_PATH $PATH_PREFIX $PORT
-
Linear Probe:
sh scripts/run_linear_ucf101.sh $CONFIG $CHECKPOINT $LIST_PATH $PATH_PREFIX $PORT sh scripts/run_linear_HMDB51.sh $CONFIG $CHECKPOINT $LIST_PATH $PATH_PREFIX $PORT
These steps will guide you through both training and evaluating models with our video understanding framework. Adjust paths and parameters according to your specific setup to ensure successful execution.
- Perform Test Only:
We have
TRAIN.ENABLE
andTEST.ENABLE
to control whether training or testing is required for the current job. If only testing is preferred, you can set theTRAIN.ENABLE
to False, and do not forget to pass the path to the model you want to test to TEST.CHECKPOINT_FILE_PATH.python tools/run_net.py --cfg $CONFIG DATA.PATH_TO_DATA_DIR $LIST_PATH TEST.CHECKPOINT_FILE_PATH $CHECKPOINT TRAIN.ENABLE False
Model | Params (M) | Pretraining Dataset | Pretrain | Finetune on UCF101 | Finetune on HMDB51 | LinearProbe on UCF101 | LinearProbe on HMDB51 |
---|---|---|---|---|---|---|---|
R3D-18 | 31.8 | Kinetics-400 | config / model / log | 88.3 / config | 57.4 / config | 79.5 / config | 46.1 / config |
R2+1D-18 | 14.4 | Kinetics-400 | config / model / log | 89.0 / config | 61.1 / config | 78.2 /config | 47.6 / config |
Slow-R18 | 20.2 | Kinetics-400 | config / model / log | 87.5 / config | 57.1 / config | 79.6 / config | 47.9 / config |
R3D-18 | 31.8 | UCF101 | config / model / log | 84.1 / config | 54.2 / config | 68.9 / config | 36.5 / config |
R2+1D-18 | 14.4 | UCF101 | config / model / log | 84.3 / config | 53.0 / config | 66.2 / config | 36.3 / config |
If you find this project useful for your research, please consider citing our paper:
@inproceedings{li2023fine,
title={Fine-grained Key-Value Memory Enhanced Predictor for Video Representation Learning},
author={Li, Xiaojie and Wu, Jianlong and He, Shaowei and Kang, Shuo and Yu, Yue and Nie, Liqiang and Zhang, Min},
booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
pages={2264--2274},
year={2023}
}
This project is made available under the Apache 2.0 license.
Special thanks to the creators of SlowFast for their pioneering work in video understanding. Our project builds upon their foundation, and we appreciate their contributions to the open-source community.