Action classification of 4 classes of UCF-101 Dataset Using Transformer

TEAM MEMBERS

Team leader :                     Other members:
Akash Ghimire (12194814)         Sanjib Tamang (12194939)
                                 Ritik Deuja (12194824)
                                 Keshav Adhikari (12194874)

Important Links

Walkthrough video==> We have already updated the snippet of codes in our Final Presentation powerpoint slides but we thought that it was not enough to explain the whole scenario. So, we decided to make a walkthrough video turtorial of our project. We hope that this video will clarify the doubts of most of viewers.

Source Codes==> This folders contains the source codes files that we have used for this project. This folder contains both .py and .ipynb(jupyter notebook) files which were used for Data Preprocessing, Feture Extraction and Engineering,training and testing the model. It has been clearly explained from our Walkthrough Video and its link can be found just above it.

Final Presentation File ==> This is the link for our Final Presentation File which has been updated in the google slides.

demo Video ==> This is the demo of how our trained model predicted actions from a video downloaded from video.

Explanation of code (As asked by Professor)

As Video action recognition is temporal event; we need sequences of images. So, for this project we sampled 20 sequences of images for each video. The code for this can be found in this jupyter file and the name of the function is extract_frames(). The detailed description has been commneted inside the jupyter notebook file itself.
Next, step is to extract RGB feratures from each of those frames. In this repository, we have used DenseNet121 to get features from each frames. DenseNet121 will output 1024 features for each frame. So, from each input video, the features will be (20 * 1024) features.The code for this can be found in this jupyter file and the name of the function is features_extraction_model(). The detailed description has been commneted inside the jupyter notebook file itself.
As it is a supervised learning method tasks, we needed to create features and labels for each videos used for this project. The code for this can be found in this jupyter file and the name of the function is create_dataset(). Every lines of code has been explained using comment.
Finally, we can trained our Transformer model using the features and labels extracted from above processes. The code for this can be found in this jupyter file. The model is created using the funnction model(). And it was trained like any other tf.keras model. And the saved model weight was saved in this folder.

what is human action recognition ?

It is the task of identifying action of a person in an image or video of performing a certain activity . Trained to identify a wide range of behaviors, like drinking, falling, and riding a bike to running and sleeping. Collects human behaviour from videos pixel by pixel and analyzes them using comprehensive and advanced algorithms. Detect, analyze and interpret activity in real time by employing AI.

Methodology

Dataset Used: ( UCF 101 Action Recognition Dataset )

An action recognition data set of realistic action videos, collected from YouTube.

Number of Action Classes: 101
Trimmed Video Dataset 
Action categories can be divided into five types: 
Human-Object Interaction, Body-Motion Only,
Human-Human Interaction,Playing Musical Instruments,and Sports.
One of the most popular public action recognition dataset

dataset downloaded from kaggle

Figure shows how UCF-101 datasets look after in training enviroment.

Figure shows how videos are placed in each action class folder.

Sample video

sample.video.mp4

                    action class

Features Extraction

Spatial-temporal features from “T” frames are extracted by using pre-trained CNN model. For our project we used DenseNet architecture.

Architecture of DenseNet

Dataset Preprocessing

Before understanding Dataset Preprocessing, let us understand some terminology.

Spatial Information: Information obtain from single Image. Example: For image analysis, DNN utilizes spatial information from image to classify an image into some class.
Temporal Information: Information obtain over a period of time or event. For example, RNN utilizes text information over a period of sequence to do language translation.

Since video are collection of images within a period of time we need Spatial-temporal information. Each video are uniformly sampled into T number of frames.

Spatial-temporal features from “T” frames are extracted by using pre-trained CNN model. For our project we used DenseNet architecture.

                            Video features extraction Pipeline

Features Engineering

Because of limitation of hardware resources, we chose only 4 action classes among 110 action classes.
Since our model is based on supervised learning method, we labeled each action class as given table.

Input features and labeled are split to train, valid, and test dataset in 0.6:0.2:0.2 ratio.

Action Recognition Model:Traditional approach

Not long ago, LSTM/GRU was the model to do these kind of stuff.

Drawbacks of LSTM/GRU

LSTM/GRU suffers from Vanishing Gradient Problem. So, RNN thus have short term memory and is not suitable of Inputs sequence is large.
LSTM/GRU can’t process all the sequence Parallelly and can’t take advantage of modern GPU which are made to handle tasks parallely.
LSTM/GRU model is not good with transfers learning.
LSTMs take longer to train.

Our method : Transformer Encoder

Transformer Encoder: Model Architecture

Positional Encoding

Describes the location or position of an entity in a sequence so that each position is assigned a unique representation.

Multi-headed Attention:

Contains multiple self-attention layer within it.
Gives more attention to important frames and less to non-important frame.

Addition layer:

Similar to resnet, add the input to the output from previous layer

Feed forward Network:

The simplest form of neural network as information is only processed in one direction.
Artificial neural network in which the connections between nodes does not form a cycle.

Training pipeline

Advantages of Transformer over LSTM/GRU

It processes all the sequences at the same, thus taking advantage of modern GPU.
It doesn’t suffer from vanishing gradient and is also good with transfer learning.
Inspired by the success of BERT (a transformer model for language translation), we adopted the BERT model for doing action recognition tasks.