Video-Classification-in-Real-Time

Video classification using VGG16 as a feature extractor and seasoning with RNN. Dataset used is UCF101 (Cricket bowling and batting classes).

A simple RNN is used to better classify temporal frame sequences from videos.

Cricket batting	Cricket batting/bowling

Some of the use cases would be monitoring anomalies, suspicious human actions, alerting the staff/authorities.

Background Theory
Running Inference
Pipeline
- Preprocessing
- Training
References
Next Steps

Background Theory

Feature extraction:

Pretrained VGG16 is used as a feature extractor after fine tuning/unfreezing its 4 top layers.
A simple classifier is then connected to VGG16 and trained to identify if the frame belongs to class1 or 2.
Then the top classifier is disconnected and only dense layer with 1024 output size is used to obtain the sparse representations of each frame.
Data to lstm format: For each video frame, the sparse representations are stacked into a tensor of size (NUM_FRAMES, LOOK_BACK, 1024).

- Model architecture -

RNN:

A standard LSTM is used. Note that you need GPU/CUDA support if you would like to run CUDnnLSTM layers in the model.
Finally, the LSTM network is trained to distinguish between your desired class1 and 2 videos.

Running Inference

Install all the required Python dependencies:

pip install -r requirements.txt

To run inference either on a test video file or on webcam:

python run.py

Note that the inference is set on the test video file by default.
To change it, simply set FROM_WEBCAM = True in the config. options at mylib/Config.py
Trained model weights (for this example) can be downloaded from here. Make sure you extract them into the folder 'weights'.
The class probabilities and inference time per frames is also displayed:

[INFO] Frame acc. predictions: 0.91895014
Frame inference in 0.0030 seconds

You can also chose to send prediction accuracies over the mail if desired. Follow the instructions in mylib>Mailer.py (to setup the sender mail).
Enter the receiver mail in the config. options at mylib/Config.py

- Predictions alert -

In case of severe false positivies, make sure to optimize the threshold and positive_frames parameters to further narrow down the predictions. Please refer config.

Threshold = 0.50
if pred >= Threshold:

if total_frames > 5:
   print('[INFO] Sending mail...')

Pipeline

Preprocessing:

Some image processing is required before training on your own data!
In 'Preprocessing.ipynb' file, the frames from each video classes are extracted and sorted into respective folders.
Note that the frames are resized to 224x224 dimensions (which is VGG16 input layer size).
The dataset can be downloaded from here.

Training:

'Train.ipynb', as the name implies trains your model.
Training is visualized with the help of TensorBoard. Use the command:

tensorboard --logdir data/_training_logs/rnn

- Training accuracy -

Make sure to review the parameters in config. options at mylib/Config.py
You will come across the parameters in Train.ipynb, they must be same during the training and inference.
If you would like to change them, simply do so in the training file and also in config. options.

References

Main:

VGG16 paper: https://arxiv.org/pdf/1409.1556.pdf
UCF101 Action Recognition Data Set: https://www.crcv.ucf.edu/data/UCF101.php

Optional:

TensorBoard: https://www.tensorflow.org/tensorboard

Next steps

Investigate and benchmark different RNN architectures for better classifying the temporal sequences.

Thanks for the read & have fun!

To get started/contribute quickly (optional) ...

Option 1
- 🍴 Fork this repo and pull request!

Option 2

👯 Clone this repo:

$ git clone https://github.com/saimj7/Action-Recognition-in-Real-Time.git

Roll it!

ezzatmostafa96/Action-Recognition-in-Real-Time