Introduction

This repository aims to perform the captioning task for 1 image using Transformer architecture and VGG16 pretrained model to conduct this task.


Getting started

In this source code, i use self-attention mechanism to build my own Transformer and use VGG16 to extract some informations of images before giving them to encoder component of Transformer

  • Crawling dataset

I use the above website and Selenium library of Python to crawl images and titles of them Crawling dataset

Training model and performing inference here Training model

Results

The achieved loss and accuracy in validation dataset are not good. However, in this case, the achieved caption is not bad.