AI Vision Librarian

Description

My Project for NextGen GPT AI Hackathon. Check Here for more info

Problem Statement

there is a large body of images / videos that have been clustered via tags. Usually generated by a human. Due to this a lot of information is lost. It also lead to video streaming / video sharing sites having a plethora of data that has never been seen.

Even with solutions that are able to segment and identify entities in images are limited as they need to be fine tuned in order to be verbose. Such as identify objects in relation to other information with fine tuning for a given dataset. This within itself is not scalable to new information.

Solution

Using a multimodal Large Language Model on images and videos can be used to generate a set of tags. These tags can be extremely specific for users to find exactly what they want. In addition it is easier to search bodies of text over bodies of images. These tags can point to the image.

How it Works

For this hackathon I used Pexels-400k dataset location at hugging-face

From there I took random samples, and prompted Gemini to create a set of tags denoted as Set[str]. From there I generated .txt files in which the URL of the image is stored on the first line and the other lines contains the generated tags.

This directory is then vectorized as called upon for retrieval augmented generation (RAG).

Why This Solution

I believe as Large Language Models with a Vision modality become cheaper, categorizing images / videos will be easy to do. At the moment this solution would be easier then hiring someone to do something similar. This solution can lead to more satisfying search results.

Contributing

Instructions for how to contribute to the project.

License

MIT

isayahc/AI-Vision-Librarian