NextVALD Attention Video Classification Model ( Not Finished )

**An Efficient Neural Network for Video Classification Challenge (No Commercial Using)**

Project Introduction

With the prevalence of smart phones and digital cameras in normal life, exponentially increasing images and videos are created, uploaded, watched and shared through internet. In the past few years, Convolutional Neural Networks (CNNs) have been demonstrated as an effective class of models for understanding image content, such as recognition, segmentation, detection and retrieval. The key enabling factors behind these results were mass computation power of GPUs and large-scale datasets such as ImageNet. In a similar vein, the amount and size of video benchmarks has also been growing recently, such as UCF101 (UCF), Kinetics-700 (Deep Mind) and Youtube-8M (Google AI), which makes video content understanding gradually under an efficient speed of development in many real-world applications. Meanwhile, many techniques related to video representation and video classification still faces a series of challenges.

Training DataSet

Youtube-8M is a large-scale benchmark for general multi-label video classification and consists of about 6.1M videos from Youtube.com, each of which has at least 1000 views with video time ranging from 120 to 300 seconds and is labeled with one or multiple tags (labels) from a vocabulary of 3862 visual entities. https://research.google.com/youtube8m/explore.html

Video Feature Vectorization

decode each video to N frames (N equals from 1 to 360, Maximum 6 mins), one frame per second.
feed the decoded frames into the Inception V3, a CNN based neural network, and fetch the ReLu activation of the last hidden layer, before the classification layer.
apply Whitening, PCA, Batch Normalization to reduce feature dimensions to 1024, 4, followed by quantization. PCA are used to reduce the dimension of the data. The goal of whitening is to make the input less redundant
apply NetVLAD Aggregation Network.

Audio Feature Extraction

use VGGish model converts audio input features into a semantically meaningful, high-level 128-D embedding which can be fed as input to a downstream classification model.
apply PCA (+ whitening) to reduce feature dimensions to 1024, 4, followed by quantization
apply NetVLAD Aggregation Network. This part is the same as video feature vectorization, where V(j,k): video-level descriptor is 128*128.

Model Architecture (wait to be added)

Feature Concatenation

wait to be added

Loss Confusion

wait to be added

Dependencies

Tensorflow >= 1.4

History

Jan 15, 2020: Basic Model

Usage Instructions

Paper : Wait to be added

Download Dataset

curl data.yt8m.org/download.py | partition=2/frame/train mirror=us python

curl data.yt8m.org/download.py | partition=2/frame/validate mirror=us python

curl data.yt8m.org/download.py | partition=2/frame/test mirror=us python

Training Model

python offline_train.py

banxia1994/NextVLAD-Attention-Model