/Image-Caption-Tutorial-using-Transformer-Pytorch

New model to achieve the better performance at a low cost. And I hope it will be a tutorial of image capture because I took really easy steps.

Primary LanguagePythonMIT LicenseMIT

Image_Captioning_Tutorial_using_Transformer_Pytorch

New model to achieve the better performance at a "low cost". And I hope it will be a tutorial of image capture because I took really easy steps. NO IMAGE DETECTION

getting started

This repo is based on many image capture models like

I develpoed the simplest and better performance model.

/

coco_dataset: data prepare(karpathy split)

  • The coco data consists of 80k train images, 40k valid images, and 40k test images. Here, I did not use test data, but trained on 80k images, and only did validation on 40k images.

download images here : 'train_coco_images2014', 'valid_coco_images2014', 'test_coco_images2014'

/

vocab

As a vocabulary for embeddedding. I tried using gpt2 (50,257 tokens) and Bert (30,232 tokens), but this required a relatively large amount of computation and was slow at learning, so I created vocab_dict separately.(See vocab.py for this.)

I selected frequently used words from the coco annotation data and proceeded with encoding.(I selected 15,000 tokens.)

** After a number of later studies, pretrained gpt2 embedding layer performed best.(check model.py)

/

encoder : CLIP

I used CLIP as an encoder. At the beginning of training, we did not include encoders (resnet) in trainable params, but later re-training by including encoders parameters for the trained capture models showed improved performance.(fine-tuned)

/

decoder : gpt2 --> base_model.py

  • The decoder structure is the simplest structure, but I used one trick. The image input was separated into several tokens and put into the gpt2 hidden layer. This means that 1 image tokens, along with 20 word tokens (N, 21, 768) are input to gpt2.

  • Of course, there is no label for image token, so the loss function contains the latter 20 (N, 20, 768) of the (N, 21, 768).

/

research

To achieve good performance, modern image capture models use image detection by default. However, this makes it difficult for users with poor gpu environment to implement. Therefore, I made various attempts to obtain a good model with less gpu.

  1. tagging model
  • The input of the image capture model: a word anchor as well as an image. I want to conduct another training on tag using various models such as cnn, lstm, etc.

  • example ) model_input : '[dog] [bark]', 'INPUT_IDS', 'IMAGE' (where <[dog] [bark]> corresponds to tag.)

  1. another attempts I wanted to see the image caption task as text -> text, not image -> text I tried to do a training process that creates arbitrary text and uses image to refine it to the correct answer and is currently in progress.

/

As a way to increase performance

  • First, 'beam search'

  • Second, 'CIDEr optimization'

  • Third, 'Ensemble'

  • Fourth, 'using random labels' where random labels are selected as random from five captions. This not only prevents overfitting but also improves performance in evaluations such as bleu and cider.

  • Here, I saw the performance improvement using only the fourth method. If all of the first, second, and third methods are used, performance improvement of 1-2 is expected based on bleu4.

/

evaluation for karpathy test: models/base_model.py

with beam_search(beam_search = 5) and self_critical_sequence_training and Ensemble(5 models)

metric score
BLEU1 0.8305
BLEU2 0.6816
BLEU3 0.5361
BLEU4 0.4158
CIDEr 1.3453
METEOR 0.2892
ROUGE_L 0.5935

/

evaluation for karpathy test: models/base_model_with_detection.py

Originally, the goal of this project was to develop image captioning model with high performance at low cost. For additional research, I also used image detection features to produce better results.

with beam_search(beam_search = 5) and self_critical_sequence_training and Ensemble(3 models)

metric score
BLEU1 0.8420
BLEU2 0.6986
BLEU3 0.5546
BLEU4 0.4336
CIDEr 1.4163
METEOR 0.2968
ROUGE_L 0.6047

you can download the features from VinVL: Revisiting Visual Representations in Vision-Language Models

3rdPlace at COCO Image Caption Challenge

result

references

I got help from sgrvinod-a-PyTorch-Tutorial-to-Image-Captioning.

Thank you for reading