Team leader : Other members:
Akash Ghimire (12194814) Sanjib Tamang (12194939)
Ritik Deuja (12194824)
Keshav Adhikari (12194874)
Walkthrough video==> We have already updated the snippet of codes in our Final Presentation powerpoint slides but we thought that it was not enough to explain the whole scenario. So, we decided to make a walkthrough video turtorial of our project. We hope that this video will clarify the doubts of most of viewers.
Source Codes==> This folders contains the source codes files that we have used for this project. This folder contains both .py and .ipynb(jupyter notebook) files which were used for Data Preprocessing, Feture Extraction and Engineering,training and testing the model. It has been clearly explained from our Walkthrough Video and its link can be found just above it.
Final Presentation File ==> This is the link for our Final Presentation File which has been updated in the google slides.
demo Video ==> This is the demo of how our trained model predicted actions from a video downloaded from video.
-
As Video action recognition is temporal event; we need sequences of images. So, for this project we sampled 20 sequences of images for each video. The code for this can be found in this jupyter file and the name of the function is extract_frames(). The detailed description has been commneted inside the jupyter notebook file itself.
-
Next, step is to extract RGB feratures from each of those frames. In this repository, we have used DenseNet121 to get features from each frames. DenseNet121 will output 1024 features for each frame. So, from each input video, the features will be (20 * 1024) features.The code for this can be found in this jupyter file and the name of the function is features_extraction_model(). The detailed description has been commneted inside the jupyter notebook file itself.
-
As it is a supervised learning method tasks, we needed to create features and labels for each videos used for this project. The code for this can be found in this jupyter file and the name of the function is create_dataset(). Every lines of code has been explained using comment.
-
Finally, we can trained our Transformer model using the features and labels extracted from above processes. The code for this can be found in this jupyter file. The model is created using the funnction model(). And it was trained like any other tf.keras model. And the saved model weight was saved in this folder.
It is the task of identifying action of a person in an image or video of performing a certain activity . Trained to identify a wide range of behaviors, like drinking, falling, and riding a bike to running and sleeping. Collects human behaviour from videos pixel by pixel and analyzes them using comprehensive and advanced algorithms. Detect, analyze and interpret activity in real time by employing AI.
An action recognition data set of realistic action videos, collected from YouTube.
Number of Action Classes: 101
Trimmed Video Dataset
Action categories can be divided into five types:
Human-Object Interaction, Body-Motion Only,
Human-Human Interaction,Playing Musical Instruments,and Sports.
One of the most popular public action recognition dataset
sample.video.mp4
action class
Spatial-temporal features from “T” frames are extracted by using pre-trained CNN model. For our project we used DenseNet architecture.
Before understanding Dataset Preprocessing, let us understand some terminology.
-
Spatial Information: Information obtain from single Image. Example: For image analysis, DNN utilizes spatial information from image to classify an image into some class.
-
Temporal Information: Information obtain over a period of time or event. For example, RNN utilizes text information over a period of sequence to do language translation.
Since video are collection of images within a period of time we need Spatial-temporal information. Each video are uniformly sampled into T number of frames.
Spatial-temporal features from “T” frames are extracted by using pre-trained CNN model. For our project we used DenseNet architecture.
Video features extraction Pipeline
-
Because of limitation of hardware resources, we chose only 4 action classes among 110 action classes.
-
Since our model is based on supervised learning method, we labeled each action class as given table.
- Input features and labeled are split to train, valid, and test dataset in 0.6:0.2:0.2 ratio.
Not long ago, LSTM/GRU was the model to do these kind of stuff.
Drawbacks of LSTM/GRU
-
LSTM/GRU suffers from Vanishing Gradient Problem. So, RNN thus have short term memory and is not suitable of Inputs sequence is large.
-
LSTM/GRU can’t process all the sequence Parallelly and can’t take advantage of modern GPU which are made to handle tasks parallely.
-
LSTM/GRU model is not good with transfers learning.
-
LSTMs take longer to train.
- Positional Encoding
- Describes the location or position of an entity in a sequence so that each position is assigned a unique representation.
- Multi-headed Attention:
-
Contains multiple self-attention layer within it.
-
Gives more attention to important frames and less to non-important frame.
- Addition layer:
- Similar to resnet, add the input to the output from previous layer
- Feed forward Network:
-
The simplest form of neural network as information is only processed in one direction.
-
Artificial neural network in which the connections between nodes does not form a cycle.
-
It processes all the sequences at the same, thus taking advantage of modern GPU.
-
It doesn’t suffer from vanishing gradient and is also good with transfer learning.
-
Inspired by the success of BERT (a transformer model for language translation), we adopted the BERT model for doing action recognition tasks.
Team meeting and discussion
TEST.MD Displaying TEST.MD.