Intelligent System for generating captions from uploaded images
-
Dataset
COCO - Common objects in conext (Microsoft)
https://cocodataset.org/#home -
DL Model
Visual Encoder-Decoder Model (ViT + GPT-2)
Link: https://huggingface.co/docs/transformers/v4.29.1/en/model_doc/vision-encoder-decoder#transformers.VisionEncoderDecoderModel
Description:
The Visual Encoder-Decoder Model can be used when system provides image as input and generates a text as ouput:
IMAGE ==> TENSOR EMBEDDING ==> TEXT
Step 01: Pretrained transformer-based vision model ==> this is the encoder (ViT)
takes the IMAGE ==> TENSOR EMBEDDING
Step 02: Pretrained language model ==> this is the decoder (GPT-2)
takes the TENSOR EMBEDDING ==> TEXT
In development of CaptionGeneratorApp was used the next technologies:
- Frontend: HTML + CSS + JQuery
- Backend: Python + Django
- Database: Postgres
- ML-modeling: PyTorch
Run the command $ python manage.py runserver
home.html
report.html