This is a team project between Sarthak Agrawal and myself during our internship at Innomatics Research Labs.
In the fast-evolving landscape of digital content, effective search engines play a pivotal role in connecting users with relevant information. Example, for Google, providing a seamless and accurate search experience is paramount. This project focuses on improving the search relevance for video subtitles, enhancing the accessibility of video content.
There is a difficulty of finding relevant video content through traditional keyword-based search methods when searching for subtitles. Currently, most search engines rely on keywords within video titles, descriptions, or closed captions. However, this approach is not ideal for finding specific content within a video based on dialogue or spoken information.
Develop an advanced search engine algorithm that efficiently retrieves subtitles based on user queries, with a specific emphasis on subtitle content. The primary goal is to leverage natural language processing and machine learning techniques to enhance the relevance and accuracy of search results by building a semantic search engine.
The database provided contained a sample of 82,498 subtitle files from opensubtitles.org. Most of the subtitles are of movies and TV-series which were released after 1990 and before 2024.
Database File Name: eng_subtitles_database.db
Database contains a table called 'zipfiles' with three columns:
- num: Unique Subtitle ID reference for www.opensubtitles.org
- name: Subtitle File Name
- content: Subtitle files were compressed and stored as a binary using 'latin-1' encoding.
Below is an outline of the steps taken to meet the project's objective:
- Reading the Data from the Database – decompressing and decoding
- Data cleaning
- Data chunking
- Generating text embedding
- Storing data in a vector database (vector stores)
- Frontend application to access the Search Engine
Find the comprehensive documentation of these steps here
In the course of this project the following tools and languages were utilized:
- Python
- ChromaDB
- Natural Language Processing
- Streamlit
- Google Colab
You can also find the project report here as well.
Below is the final ouput of our project - an interactive web application.
The figure below shows the redirection to the OpenSubtitles.org page
In this project we built an MVP (Minimum Viable Product) that bridges the gap between a user’s search intent and the content of video subtitles. We did this by utilizing techniques like light cleaning to preserve contextual meanings, semantic chunking on our subtitle files to mitigate information loss and created an iterative web app to test our solution. This is a glimpse of what can be done in building a semantic based search engine for video subtitles. We encourage exploring this concept deeper in building a more robust search engine.