New model to achieve the better performance at a "low cost". And I hope it will be a tutorial of image capture because I took really easy steps. NO IMAGE DETECTION
This repo is based on many image capture models like
- Show-and-Tell-A-Neural-Image-Caption-Generator,
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,
- Bottom-Up and Top-Down Attention for Image Captioning and Visual QuestionAnswering,
- Meshed-Memory Transformer for Image Captioning
- Self-critical Sequence Training for Image Captioning and so on.
I develpoed the simplest and better performance model.
/
- The coco data consists of 80k train images, 40k valid images, and 40k test images. Here, I did not use test data, but trained on 80k images, and only did validation on 40k images.
download images here : 'train_coco_images2014', 'valid_coco_images2014', 'test_coco_images2014'
-
download caption annotation here : (http://images.cocodataset.org/annotations/annotations_trainval2014.zip)
-
In order to get many training data, I followed the karpathy split. I used 118287 training data and 5000 valid data. Karpathy split data is available on the coco dataset site.
/
As a vocabulary for embeddedding. I tried using gpt2 (50,257 tokens) and Bert (30,232 tokens), but this required a relatively large amount of computation and was slow at learning, so I created vocab_dict separately.(See vocab.py for this.)
I selected frequently used words from the coco annotation data and proceeded with encoding.(I selected 15,000 tokens.)
** After a number of later studies, pretrained gpt2 embedding layer performed best.(check model.py)
/
I used CLIP as an encoder. At the beginning of training, we did not include encoders (resnet) in trainable params, but later re-training by including encoders parameters for the trained capture models showed improved performance.(fine-tuned)
/
-
The decoder structure is the simplest structure, but I used one trick. The image input was separated into several tokens and put into the gpt2 hidden layer. This means that 1 image tokens, along with 20 word tokens (N, 21, 768) are input to gpt2.
-
Of course, there is no label for image token, so the loss function contains the latter 20 (N, 20, 768) of the (N, 21, 768).
/
To achieve good performance, modern image capture models use image detection by default. However, this makes it difficult for users with poor gpu environment to implement. Therefore, I made various attempts to obtain a good model with less gpu.
- tagging model
-
The input of the image capture model: a word anchor as well as an image. I want to conduct another training on tag using various models such as cnn, lstm, etc.
-
example ) model_input : '[dog] [bark]', 'INPUT_IDS', 'IMAGE' (where <[dog] [bark]> corresponds to tag.)
- another attempts I wanted to see the image caption task as text -> text, not image -> text I tried to do a training process that creates arbitrary text and uses image to refine it to the correct answer and is currently in progress.
/
-
First, 'beam search'
-
Second, 'CIDEr optimization'
-
Third, 'Ensemble'
-
Fourth, 'using random labels' where random labels are selected as random from five captions. This not only prevents overfitting but also improves performance in evaluations such as bleu and cider.
-
Here, I saw the performance improvement using only the fourth method. If all of the first, second, and third methods are used, performance improvement of 1-2 is expected based on bleu4.
/
with beam_search(beam_search = 5) and self_critical_sequence_training and Ensemble(5 models)
metric | score |
---|---|
BLEU1 | 0.8305 |
BLEU2 | 0.6816 |
BLEU3 | 0.5361 |
BLEU4 | 0.4158 |
CIDEr | 1.3453 |
METEOR | 0.2892 |
ROUGE_L | 0.5935 |
/
Originally, the goal of this project was to develop image captioning model with high performance at low cost. For additional research, I also used image detection features to produce better results.
with beam_search(beam_search = 5) and self_critical_sequence_training and Ensemble(3 models)
metric | score |
---|---|
BLEU1 | 0.8420 |
BLEU2 | 0.6986 |
BLEU3 | 0.5546 |
BLEU4 | 0.4336 |
CIDEr | 1.4163 |
METEOR | 0.2968 |
ROUGE_L | 0.6047 |
you can download the features from VinVL: Revisiting Visual Representations in Vision-Language Models
I got help from sgrvinod-a-PyTorch-Tutorial-to-Image-Captioning.