This is the implementation of Video Transformer Network approach for Action Recognition in PyTorch. The repository also contains training code for other action recognition models, such as 3D CNNs, LSTMs, I3D, R(2+1)D, Two stream networks.
The code is tested on Python 3.5, with dependencies listed in requirements.txt
file. You can install required packages with:
pip install -r requirements.txt
You may also need to install FFmpeg in order to prepare training data:
sudo apt-get install ffmpeg
You need to download and pre-process Action Recognition dataset first:
You can download Kinetics dataset and split videos into 10 second clips using these instructions.
Convert annotation files to json using provided python script:
python3 utils/kinetics_json.py ${data}/kinetics/kinetics-400_train.csv ${data}/kinetics/kinetics-400_val.csv ${data}/kinetics/kinetics-400_test.csv ${data}/kinetics/kientics_400.json
Download video list for subset of Kinetics here. You can follow the same instructions as for complete Kinetics for data downloading and pre-processing.
Download UCF-101 and train-test split here
Convert all splits to json:
python3 utils/ucf101_json.py ${data}/ucf-101/ucfTrainTestlist
HMDB-51 videos and train/test splits can be found here.
Convert all splits to json:
python3 utils/hmdb51_json.py ${data}/hmdb-51/splits/
You may want to pre-process video files in order to speed up data loading and/or save some disk space.
You must convert videos either into video (.mp4) or frames (.jpg) format (controlled by --video-format
option).
Frames format takes more disk space but significantly improves data loading performance,
however video format saves disk space, but takes more time for decoding.
You may also want to re-scale your videos to (128x or 256x), which is also saves disk space and improves data-loading performance.
Convert your videos, using the provided script. For example:
python3 utils/preprocess_videos.py --annotation_file ${data}/kinetics/kinetics_400.json \
--raw_dir ${data}/kinetics/data \
--destination_dir ${data}/kinetics/frames_data \
--video-size 256 \
--video-format frames \
--threads 6
You need to create configuration file or update existing in ./datasets
directory
for your dataset to adjust paths and other parameters.
The default structure of data directories is the following:
.../
data/ (root dir)
kinetics/
frames_data/ (video path)
.../ (directories of class names)
.../ (directories of video names)
... (jpg files)
kinetics_400.json (annotation path)
After you prepared the data, you can train or validate your model. Use commands below as an example.
For complete list of options run python3 main.py --help
. Here is the summary of some important options:
--result-path
-- Directory where logs and checkpoints will be stored. If you provide path to an directory from previous runs, the training will be resumed from latest checkpoint unless--no-resume-train
is provided.--model
-- Name of the model. The string before the first underscore symbol may be recognized as an encoder name (e.g. resnet34_vtn)en ENCODER_DECODER, you can find all implemented models at:./action_recognition/models/
.--clip-size
-- Number of frames in input clips. Note that you should multiply it by--st
to get effective temporal receptive field--st
-- Number of skipped frames, when sampling input clip. e.g. if st=2 then every 2nd frame will be skipped.--resume-path
-- Path to checkpoint with pre-trained model, either for validation or fine-tuning.--no-cuda
-- Use this option in environment without CUDA
python3 main.py --root-path ~/data --result-path ~/logs/ --dataset kinetics --model resnet34_vtn \
--batch 64 -j 12 --clip-size 16 --st 2 --no-train --no-val --test --pretrain-path ~/resnet34_vtn.pth
python3 main.py --root-path ~/data --result-path ~/logs/experiment_name --dataset kinetics --model resnet34_vtn \
--batch 64 -j 12 --clip-size 16 --st 2 --epochs 120 --lr 1e-4
python3 main.py --root-path ~/data --result-path ~/logs/experiment_name --dataset kinetics --model resnet34_vtn \
--batch 64 -j 12 --clip-size 16 --st 2 --epochs 120 --lr 1e-4 --resume-path ~/save_100.pth
python3 main.py --root-path ~/data --result-path ~/logs/experiment_name/2 --dataset kinetics --model resnet34_vtn \
--batch 64 -j 12 --clip-size 16 --st 2 --epochs 120 --lr 1e-4
python3 main.py --root-path ~/data --result-path ~/logs/ --dataset ucf101 --model resnet34_vtn \
--batch 64 -j 12 --clip-size 16 --st 2 --lr 1e-5 --pretrain-path ~/resnet34_vtn_kinetcs.pth
NOTE Modules that used LayerNormalization
can be converted to ONNX only with flag --no-layer-norm
,
but this may lead to decrease the accuracy of converted model.
Otherwise the script will crash with message RuntimeError: ONNX export failed: Couldn't export operator aten::std
.
PyTorch to ONNX:
python3 main.py --model resnet34_vtn --clip-size 16 --st 2 --pretrain-path ~/resnet34_vtn_kinetics.pth --onnx resnet34_vtn.onnx
ONNX to OpenVINO:
mo.py --input_model resnet34_vtn.onnx --input_shape '[1,16,3,224,224]'
We provide some pre-trained models for your convenience:
Model | Input | Dataset | Video@1 | Checkpoint | Command |
---|---|---|---|---|---|
MobileNetV2-VTN | RGB | Kinetics | 62.51% | Download | python main.py --dataset kinetics --model mobilenetv2_vtn -b32 --lr 1e-4 --seq 16 --st 2 |
ResNet34-VTN | RGB | Kinetics | 68.32% | Download | python main.py --dataset kinetics --model resnet34_vtn -b32 --lr 1e-4 --seq 16 --st 2 |
ResNet34-VTN | RGB-diff | Kinetics | 67.31% | Download | python main.py --dataset kinetics --model resnet34_vtn_rgbdiff -b32 --lr 1e-4 --seq 16 --st 2 |
SE-ResNext101_32x4d-VTN | RGB | Kinetics | 69.52% | Download | python main.py --dataset kinetics --model se-resnext101-32x4d_vtn -b32 --lr 1e-4 --seq 16 --st 2 --no-mean-norm --no-std-norm |
SE-ResNext101_32x4d-VTN | RGB-diff | Kinetics | 68.04% | Download | python main.py --dataset kinetics --model se-resnext101-32x4d_vtn_rgbdiff -b32 --lr 1e-4 --seq 16 --st 2 --no-mean-norm --no-std-norm |
ResNet34-VTN | RGB | UCF101 | 90.27% | Download | python main.py --dataset ucf101_1 --model resnet34_vtn -b16 --lr 1e-5 --seq 16 --st 2 --pretrain-path /PATH/TO/PRETRAINED/MODEL |
ResNet34-VTN | RGB-Diff | UCF101 | 93.02% | Download | python main.py --dataset ucf101_1 --model resnet34_vtn_rgbdiff -b16 --lr 1e-5 --seq 16 --st 2 --pretrain-path /PATH/TO/PRETRAINED/MODEL |
SE-ResNext101_32x4d-VTN | RGB | UCF101 | 91.8% | Download | python main.py --dataset ucf101_1 --model se-resnext101-32x4d_vtn -b16 --lr 1e-5 --seq 16 --st 2 --no-mean-norm --no-std-norm --pretrain-path /PATH/TO/PRETRAINED/MODEL |
SE-ResNext101_32x4d-VTN | RGB-diff | UCF101 | 93.44% | Download | python main.py --dataset ucf101_1 --model se-resnext101-32x4d_vtn_rgbdiff -b16 --lr 1e-5 --seq 16 --st 2 --no-mean-norm --no-std-norm --pretrain-path /PATH/TO/PRETRAINED/MODEL |
SE-ResNext101_32x4d-VTN | RGB | HMDB51 | 66.64% | Download | python main.py --dataset hmdb51_1 --model se-resnext101-32x4d_vtn -b16 --lr 1e-5 --seq 16 --st 2 --no-mean-norm --no-std-norm --pretrain-path /PATH/TO/PRETRAINED/MODEL |
SE-ResNext101_32x4d-VTN | RGB-diff | HMDB51 | 73.22% | Download | python main.py --dataset hmdb51_1 --model se-resnext101-32x4d_vtn_rgbdiff -b16 --lr 1e-5 --seq 16 --st 2 --no-mean-norm --no-std-norm --pretrain-path /PATH/TO/PRETRAINED/MODEL |
You can try your models after converting it to the OpenVINO format or pre-trained model from OpenVINO using the [demo application from OpenVINO toolkit](https://docs.openvinotoolkit.org/latest/_inference_engine_ie_bridges_python_sample_action_recognition_README.html