My Project for NextGen GPT AI Hackathon. Check Here for more info
there is a large body of images / videos that have been clustered via tags. Usually generated by a human. Due to this a lot of information is lost. It also lead to video streaming / video sharing sites having a plethora of data that has never been seen.
Even with solutions that are able to segment and identify entities in images are limited as they need to be fine tuned in order to be verbose. Such as identify objects in relation to other information with fine tuning for a given dataset. This within itself is not scalable to new information.
Using a multimodal Large Language Model on images and videos can be used to generate a set of tags. These tags can be extremely specific for users to find exactly what they want. In addition it is easier to search bodies of text over bodies of images. These tags can point to the image.
For this hackathon I used Pexels-400k dataset location at hugging-face
From there I took random samples, and prompted Gemini to create a set of tags denoted as Set[str]. From there I generated .txt files in which the URL of the image is stored on the first line and the other lines contains the generated tags.
This directory is then vectorized as called upon for retrieval augmented generation (RAG).
I believe as Large Language Models with a Vision modality become cheaper, categorizing images / videos will be easy to do. At the moment this solution would be easier then hiring someone to do something similar. This solution can lead to more satisfying search results.
Instructions for how to contribute to the project.