Recognizing Violent Human Actions in Video

Projet Overview


The dataset is a subset of the Kinetics dataset (include reference). Kinetics comprises a set of 400 general human action classes, each class having a total of 400 video samples in average, taken from YouTube videos. From this dataset, we created ViolentHumanActions_v2, comprising 20 classes, 9 of which are true violent actions, and the rest being non violent actions, but resembling violent ones. The non-violent actions were selected among the most frequent wrong predictions, when evaluating the model on the violent classes only.

Violent Classes Non-Violent Classes
punching person punching bag
side kick singing
high kick playing squash
slapping stretching arm
wrestling soccer juggling
drop kick playing cricket
sword fighting kissing
headbutting shaking head
capoeira headbanging
. tai chi
. tango dancing

The dataset has a train, valid and test split, each respectively accounting for roughly 80%, 15% and 5% of the whole dataset.

The dataset is located in /datasets/ViolentHumanActions_v2. It each split has a .csv file, each line starting by an action label, and the corresponding video filename.

    |       |           |         |
  data    train.csv  valid.csv  test.csv
|       |      |        
train  valid  test
    |       |            |
 class_0   class_1 ... class_N
    |       |____________________
    |                           |
videos for class_0    videos for class_1

Model architecture

The model is based on the i3D two-stream architecture (include reference), which uses a model trained on clips as sequences of RGB frames, and the other is trained on sequences of optical-flow frames. Both models are a 3D inflated variant of the Inception model (check official name of model). The mode is implemented in /models/i3d/


Since numerical arrays of uncompressed videos require much more space (rephrase), saving the dataset in RGB and optical-flow pre-processed format is unfeasible. Therefore, the dataset is made of compressed video files, and pre-processing is computed on-the-fly, as videos are loaded and fed to the network. See for details on the pre-processing stage.

Setting up the project

Cloning the repository:

$ git clone

Environment setup

  1. Install Anaconda, if not already done, by following these instructions:

  2. Create a conda environment using the environment.yaml file, to install the dependencies:
    $ conda env create -f environment.yaml

  3. Activate the new conda environment: $ conda activate RecognizingViolentActions

Getting the data

$ python

Running experiments

Training the models

Models are trained by stochastic gradient descent. Default hyperparameters are

  • learning rate: 0.001
  • momentum: 0.9
  • batch size: 5
  • maximum number of frames per clip: 60 (this is a max value, since some clips may have fewer frames, even if clips have in average 250 frames)

All hyperparameters and other options are detailed

The RGB and optical-flow models are trained independently, with each type specified by the --stream argument. Models are saved at every end of epoch in /out/i3d/saved_models, and training log files are saved in /out/i3d/logs/.

$ python --stream rgb --num_epochs 1

Training can be resumed at a given epoch by using the --resume_epoch argument:
$ python --stream rgb --resume_epoch 1

The project already contains checkpoints for the RGB and optical-flow models, pre-trained on Kinetics-400, and are saved under the names i3d_rgb_epoch-0.pkl and i3d_flow_epoch-0.pkl.

Testing the models

The rgb and optical-flow models are first tested indepedently. To test a model, specify the saved checkpoint. Testing results are saved in log files in /out/i3d/logs, and the model predictions are saved in out/i3d/preds.

To test the two stream joint predictions, the predictions of an RGB and an optical-flow model are averaged. To do so, specify the saved predictions of both the RGB and optical-flow model:

$ python test-joint --rgb_preds_path out/i3d/preds/preds_rgb_epoch-0.pkl --flow_preds_path out/i3d/preds/preds_flow_epoch-0.pkl --output_file joint-test_epoch_0.csv


Demonstration of a video classification using averaged predictions:

RGB pred: Dancing Flow pred: Wrestling Joint pred: Wrestling
See out/rgb.gif See out/flow.gif

The following results show the top-1 testing accuracy using the RGB, optical-flow, and joint predictions. It also shows results after fine-tuning the RGB model on ViolentHumanActions_v2 for 3 epochs (using the default hyperparameters):

Testing accuracy:

Model Pre-trained on Kinetics Fine-tuned
RGB-I3D 0.6014 0.7210
Flow-I3D 0.6159 -
Two-Stream I3D 0.6630 0.7391