If this code helps with your work/research, please consider citing
Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic and Bryan Russell. ActionVLAD: Learning spatio-temporal aggregation for action classification. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
@inproceedings{Girdhar_17a_ActionVLAD,
title = {{ActionVLAD}: Learning spatio-temporal aggregation for action classification},
author = {Girdhar, Rohit and Ramanan, Deva and Gupta, Abhinav and Sivic, Josef and Russell, Bryan},
booktitle = {CVPR},
year = 2017
}
- July 15, 2017: Released Charades models
- May 7, 2017: First release
If you're only looking for our final last-layer features that can be combined with your method, we provide those for the following datasets:
- HMDB51:
data/hmdb51/final_logits
- Charades v1:
data/charadesV1/final_logits
Note: Be careful to re-organize them given our filename and class ordering.
This code has been tested on a Linux (CentOS 6.5) system, though should be compatible with any OS running python and tensorflow.
-
TensorFlow (0.12.0rc0)
-
There have been breaking API changes in v1.0, so this code is not directly compatible with the latest tensorflow release. You can try to use my pre-compiled WHL file.
-
You may consider installing tensorflow into an environment. On anaconda, it can be done by:
$ conda create --name tf_v0.12.0rc0 $ source activate tf_v0.12.0rc0 $ conda install pip # need to install pip into this env, # else it will use global pip and overwrite your # main TF installation $ pip install h5py # and other libs, if need to be installed $ git clone https://github.com/tensorflow/tensorflow.git $ git checkout tags/0.12.0rc0 $ # Compile tensorflow. Refer https://www.tensorflow.org/install/install_sources $ # If compiling on a CentOS (<7) machine, you might find the following instructions useful: $ # http://rohitgirdhar.github.io/tools/2016/12/23/compile-tf.html $ pip install --upgrade --ignore-installed /path/to/tensorflow_pkg.whl
-
-
Standard python libraries
- pip
- scikit-learn 0.17.1
- h5py
- pickle, cPikcle etc
This demo runs the RGB ActionVLAD model on a video. You will need the pretrained
models, which can be downloaded using the get_models.sh
script, as described later
in this document.
$ cd demo
$ bash run.sh <video_path>
The videos need to be stored on disk as individual frame JPEG files, and similarly for optical flow.
The list of train/test videos are specified by text files, similar to the one in
data/hmdb51/train_test_lists/train_split1.txt
. Each line consists of:
video_path number_of_frames class_id
Sample train/test files are in data/hmdb51/train_test_lists
. The frames must be named in format: image_%05d.jpg
.
Flow is stored similarly, with 2(n-1) files per video than the frames (n), named as flow_%c_%05d.jpg
, where the
%c
corresponds to x
and y
. This follows the data style followed in
various previous works.
NOTE: For HMDB51, I renamed the videos to avoid issues with special characters in the filenames,
and hence the numbers in the train/test files.
The list of actual filenames is provided in data/hmdb51/train_test_lists/AllVideos.txt
, and the new
name for each video in that list is the 1-indexed line number of that video.
The AllVideos_renamed.txt
contains all the HMDB videos that are a part of one or all of the train/test splits
(it has fewer entries than AllVideos.txt
because some videos are not in any split). So, the video brush_hair/19
in that file (and in the train/test split files) would correspond to the line number 19 in AllVideos.txt
.
Create soft links to the directories where the frames are stored as following, so the provided scripts work out-of-the-box.
$ ln -s /path/to/hmdb51/frames data/hmdb51/frames
$ ln -s /path/to/hmdb51/flow data/hmdb51/flow
and so on. Since the code requires random access to this data while training, it is advisable to store the frames/flow on a fast disk/SSD.
For ease of reproduction, you can download our frames (.tgz
, 9.3GB) and
optical flow (.tgz
, 4.7GB) on HMDB51.
Our UCF101 models should be compatible with the data provided with the Good Practices paper.
Can be directly downloaded from official website.
This code assumes the 480px scaled frames
to be stored at data/charadesV1/frames
.
Download the models using get_models.sh
script. Comment out specific lines
to download a subset of models.
Test all the models using the following scripts:
$ cd experiments
$ bash ext_all_logits.sh # Stores all the features for each split
$ bash combine_streams.sh <split_id> # change split_id to get final number for each split.
The above scripts (with provided models) should reproduce the following performance. The
iDT features are available from [Varol16]. You can also run these with the pre-computed
features provided in the data/
folder.
Split | RGB | Flow | Combined (1:2) | iDT[Varol16] | ActionVLAD+iDT |
---|---|---|---|---|---|
1 | 51.4 | 59.0 | 66.7 | 56.7 | 70.1 |
2 | 49.2 | 59.7 | 66.5 | 57.2 | 69.0 |
3 | 48.6 | 60.6 | 66.3 | 57.8 | 70.1 |
Avg | 49.7 | 59.8 | 66.5 | 57.2 | 69.7 |
NOTE: There is very small difference (<0.1%) in the final numbers above from what's reported in the paper.
This was due to an undocumented behavior of tensorflow tf.train.batch
functionality,
which is slightly non-deterministic when used with multiple threads.
This can lead to some local shuffling in the order of videos at test time, which
leads to inconsistent results when late-fusing different methods.
This has been fixed now by
forcing the use of a single thread when saving features to the disk.
Charades models were trained using a slightly different version of TF, so need a
bit more work to test. Download the model data file as mentioned
in the get_data.sh
script (by default, it will download).
Then,
$ cp models/PreTrained/ActionVLAD-pretrained/charadesV1/checkpoint.example models/PreTrained/ActionVLAD-pretrained/charadesV1/checkpoint
$ vim models/PreTrained/ActionVLAD-pretrained/charadesV1/checkpoint
$ # modify the file and replace the $BASE_DIR with the **absolute path** of where the ActionVLAD repository is cloned to
$ # Now, for testing
$ cd experiments && bash 006_InceptionV2TSN_RGB_Charades_eval.sh
$ cd .. && bash eval/charades_eval.sh data/charadesV1/feats.h5
The above should reproduce the following numbers:
mAP | wAP | |
---|---|---|
ActionVLAD (RGB) | 17.66 | 25.17 |
Note that in the following training steps, RGB model is trained directly on top of ImageNet initialization while the flow models are trained over the flow stream of a two-stream model. This is just because we found that training the last few layers in RGB stream (of a two-stream model) gets good enough performance, so everything before and including conv5_3 is left untouched to the imagenet initialization. Since we build our model on top of conv5_3, we end up essentially training on top of ImageNet initialization.
$ ### Initialization for ActionVLAD (KMeans)
$ cd experiments
$ bash 001_VGG_RGB_HMDB_netvlad_feats_for_clustering.sh # extract random subset of features
$ 001_VGG_RGB_HMDB_netvlad_cluster.sh # cluster the features to initialize ActionVLAD
$ ### Training the model
$ bash 001_VGG_RGB_HMDB_netvlad_stage1.sh # trains the last layer with fixed ActionVLAD
$ bash 001_VGG_RGB_HMDB_netvlad_stage2.sh # trains the last layer+actionVLAD+conv5
$ bash 001_VGG_RGB_HMDB_netvlad_eval.sh # evaluates the final trained model
$ ### Initialization for ActionVLAD (KMeans)
$ cd experiments
$ bash 001_VGG_Flow_HMDB_netvlad_feats_for_clustering.sh # extract random subset of features
$ 001_VGG_Flow_HMDB_netvlad_cluster.sh # cluster the features to initialize ActionVLAD
$ ### Training the model
$ bash 001_VGG_Flow_HMDB_netvlad_stage1.sh # trains the last layer with fixed ActionVLAD
$ bash 001_VGG_Flow_HMDB_netvlad_stage2.sh # trains the last layer+actionVLAD+conv5
$ bash 001_VGG_Flow_HMDB_netvlad_eval.sh # evaluates the final trained model
The following scripts run testing on the flow stream of our two-stream models. As mentioned earlier, we didn't need a RGB stream model for ActionVLAD training since we could train directly on top of ImageNet initialization.
$ cd experiments
$ bash 005_VGG_Flow_HMDB_TestTwoStream.sh
You can also train two-stream models using this code base. Here's a sample script to train a RGB stream (not tested, so might require playing around with hyperparameters):
$ cd experiments
$ bash 005_VGG_RGB_HMDB_TrainTwoStream.sh
$ bash 005_VGG_RGB_HMDB_TestTwoStream.sh
[Varol16]: Gul Varol, Ivan Laptev and Cordelia Schmid. Long-term Convolutions for Action Recognition. arXiv 2016.
This code is based on the tensorflow/models repository, so thanks to the original authors/maintainers for releasing the code.