SlowFast Network Video Classification & Detection

Features

This algorithm is from Facebook AI research
PySlowFast: high performance, light weight and efficient video understanding codebase in pytorch.
SlowFast networks pre-trained on the Kinetics 400 dataset
Slowfast AI algorithm recognized what activity is being performed in the video. It also can detect any action is happening
Used for video understanding research on different tasks (classification, detection, & etc.).
SlowFast: novel method to analyze the contents of a video segment.
It has two path away, one is capturing semantics of the object which is class slow path away another one is fast path away which capture the motion.
Both paths performs 3D convolution operation
Github: https://github.com/facebookresearch/SlowFast

SlowFast networks pre-trained on the Kinetics 400 dataset
Kinetics: datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes, depending on the dataset version
Videos include: human-object interactions, human-human interactions such as shaking hands and hugging
Each action class has at least 400/600/700 video clips
Each clip is human annotated with a single action class and lasts around 10 seconds.
Load a pre trained video classification model in PyTorchVideo and run it on a test video
Running SlowFast networks pre-trained on the Kinetics 400 dataset.
Link: https://www.deepmind.com/open-source/kinetics

Create new conda environment run on jupyter notebook or google colab notebook

conda create --name Slowfast
git clone https://github.com/facebookresearch/SlowFast.git

Import all functions
Load the model
Set the model to eval mode and move to desired device
Download the id to label mapping for the Kinetics 400 dataset on which the torch hub models were trained.
This will be used to get the category label names from the predicted class ids.
Define input transform
Before passing the video into the model we need to apply some input transforms and sample a clip of the correct duration
Run Inference
Download an example video
Load the video and transform it to the input format required by the model
Get Predictions

SlowFast model architectures are based on [1] with pre-trained weights using the 8x8 setting on the Kinetics dataset.
Both the Slow and Fast pathways use a 3D ResNet model, capturing several frames at once and running 3D convolution operations on them
Reference: [1] Christoph Feichtenhofer et al, “SlowFast Networks for Video Recognition” https://arxiv.org/pdf/1812.03982.pdf

The model architecture is based on [1] with pre-trained weights using the 8x8 setting on the Kinetics dataset.
A residual neural network (ResNet) is an artificial neural network (ANN)
ResNet 3D is a type of model for video that employs 3D convolutions.
This model collection consists of two main variants.
The first formulation is named mixed convolution (MC) and consists in employing 3D convolutions only in the early layers of the network, with 2D convolutions in the top layers.

X3D model architectures are based on pre-trained on the Kinetics dataset
Reference Paper: Christoph Feichtenhofer, “X3D: Expanding Architectures for Efficient Video Recognition.” https://arxiv.org/abs/2004.04730

X3D is Multi-grid Training:
Multi-grid training is a mechanism to train video architectures efficiently. Instead of using a fixed batch size for training, this method proposes to use varying batch sizes in a defined schedule, yet keeping the computational budget approximately unchanged by keeping batch x time x height x width a constant.