Video-Captioning

Video Captioning is a sequential learning model that employs an encoder-decoder architecture. It accepts a video as input and produces a descriptive caption that summarizes the content of the video.

The significance of captioning stems from its capacity to enhance accessibility to videos in various ways. An automated video caption generator aids in improving the searchability of videos on websites. Additionally, it facilitates the grouping of videos based on their content by making the process more straightforward.

Table of contents

Inspiration

While exploring new projects, I discovered video captioning and noticed the scarcity of reliable resources available. With this project, I aim to simplify the implementation of video captioning, making it more accessible for individuals interested in this field.

Dataset

This project utilizes the MSVD dataset, which consists of 1450 training videos and 100 testing videos, to facilitate the development of video captioning models.

Setup

Clone the repository : git clone https://github.com/MathurUtkarsh/Video-Captioning-Using-LSTM-and-Keras.git

Video Caption Generator: cd Video-Captioning

Create environment: conda create -n video_caption python=3.7

Activate environment: conda activate video_caption

Install requirements: pip install -r requirements.txt

Usage

To utilize the pre-trained models, follow these steps:

  1. Add a video to the "data/testing_data/video" folder.
  2. Execute the "predict_realtime.py" file using the command: python predict_realtime.py.

For quicker results, extract the features of the video and save them in the "feat" folder within the "testing_data" directory.

To convert the video into features, run the "extract_features.py" file using the command: python extract_features.py.

For local training, run the "train.py" file. Alternatively, you can use the "Video_Captioning.ipynb" notebook.

Model

Training Architecture

Encoder Model

Decoder Model

Loss

This is the graph of epochs vs loss. The loss used here is categorical crossentropy.

Metric

This is the graph of epochs vs metric. The metric used here is accuracy.

Features

  • Realtime implementation
  • Two types of search algorithms depending upon the requirements
  • Beam search and Greedy search

The greedy search algorithm chooses the word with the highest probability at each step of generating the output sequence. In contrast, the beam search algorithm considers multiple alternative words at each timestep based on their conditional probabilities.

For a more detailed understanding of these search algorithms, you can refer to this informative blog post.

Performance of both algorithms on testing data

Video Beam Text(Time taken) Greedy Text(Time taken)
a woman is seasoning some food(22.05s) a woman is seasoning some food(0.70s)
a man is singing (13.79s) a man is performing on a stage(0.77s)
the animal is sitting on the ground (21.16s) a person is playing(0.67s)
a man is riding a bicycle (22.20s) a man is riding a bicycle(0.66s)
a man is spreading a tortilla (25.65s) a man is spreading a tortilla (0.75s)
a woman is mixing some food (35.91s) a woman is mixing some food(0.72s)
a dog is dancing (15.58s) a dog is making a dance(0.68s)
a person is cutting a pineapple (24.31s) a person is cutting a piece of pieces(0.70s)
a cat is playing the piano (26.48s) a cat is playing the piano(0.70s)
a man is mixing ingredients in a bowl (38.16s) a man is mixing ingredients in a bowl(0.69s)

Scripts

  • train.py contains the model architecture
  • predict_test.py is to check for predicted results and store them in a txt file along with the time taken for each prediction
  • predict_realtime.py checks the results in realtime
  • model_final folder contains the trained encoder model along with the tokenizerl and decoder model weights.
  • features.py extracts 80 frames evenly spread from the video and then those video frames are processed by a pre-trained VGG16 so each frame has 4096 dimensions. So for a video we create a numoy array of shape(80, 4096) config.py contains all the configurations i am using
  • Video_Captioning.ipynb is the notebook i used for training and building this project.

Future Development

  • Integrating attention blocks and pretrained embeddings (e.g., GloVe) to enhance the model's comprehension of sentences.
  • Exploring the use of other pretrained models, such as I3D, specifically designed for video understanding, to improve feature extraction.
  • Expanding the model's capability to handle longer videos, as it currently supports only 80 frames.
  • Incorporating a user interface (UI) into the project for a more user-friendly experience.
  • Using Chat-GPT API provided by Open-AI to get more creative and compelling captions.

References

SV2T paper 2015

Keras implementation

Intelligent-Projects-Using-Python