/Encoder-Decoder-based-Video-Captioning

Implementation of Encoder-Decoder Model for Video Captioning in Tensorflow

Primary LanguagePythonMIT LicenseMIT

Encoder-Decoder-based Video Captioning

This repository provides an Encoder-Decoder Sequence-to-Sequence model to generate captions for input videos. Moreover, pre-Trained VGG16 model is being used to extract features for every frame of the video.

The ability to be applied for numerous applications mark Video Captioning's importance. For example, it can be applied to help search videos across web pages in an efficient manner and it can also cluster the videos having a large degree of similarity in terms of their respective generated captions.

Requirements

  • Tensorflow
  • Keras
  • OpenCV
  • NumPy
  • FuncTools

Usage

Data

  • The MSVD dataset developed by Microsoft can be downloaded from here.
  • This data set contains 1450 short YouTube clips that have been manually labeled for training and 100 videos for testing.
  • Each video has been assigned a unique ID and each ID has about 15–20 captions.

Training and Testing

  • To extract features for frames of every single input videos using pre-Trained VGG16 model, run Extract_Features_Using_VGG.py.
  • To train the developed model, run training_model.py.
  • To use the trained Video Captioning model for inference, run predict_model.py.
  • To use the trained model for real-time Video-Caption generation, run Video_Captioning.py.

Results

Following are a few results of the developed Video Captioning approach on test videos:-

Test Video Generated Caption
alt text a woman is mixing some food
a man is performing on a stage
alt text a man is mixing ingredients in a bowl
a man is spreading a tortilla
a woman is seasoning some food
alt text a cat is playing the piano