This project applies Transformer-based model for Image captioning task. In this study project, most of the work are reimplemented, some are adapted with lots of modification. The purpose of this project is to test the performance of the Transformer architecture and Bottom-Up feature, I conduct experiment and compare two different ways to extract features from visual input (image) and encode it to a sequence.
The following figure gives an overview of the baseline model architectures.
There are 2 ways to embed visual inputs:
- In the patch-based architecture, image features can be extracted by split the image into patches (16x16), then flatten (same method as Vision-Transformer). Or extracted using a fixed grid tiles (8x8) follows an Inception V3 model.
Patch-based Encoders |
---|
- In the architecture which uses bottom-up attention, FasterRCNN is used to extract features for each detected object in the image. This method captures visual meanings with object-aware semantics and generates some very good captions (in my opinion though).
Bottom-Up Encoder |
---|
Vocabulary can be built in two ways:
- Use AutoTokenizer from Huggingface Transformer librabry
- From scratch (suggested if using small dataset)
Extract features and save as numpy array
- To extract features from InceptionV3, use
preprocess/grid/cnn/preprocess.py
- To extract bottom-up features, I provide Colab Notebook which adapts code from Detectron model
I train both the patch-based and bottom-up models on Flickr30k dataset which contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators for each image. Download COCO-format Flickr30k
For COCO captioning data format, see COCO format
The results shown here are recorded after training for 100 epochs, on validation split. Captions are generated by using Beam search with width of size 3.
Model | Bleu_1 | Bleu_2 | Bleu_3 | Bleu_4 | METEOR | ROUGE_L | CIDEr | SPICE |
---|---|---|---|---|---|---|---|---|
Transformer (deit_tiny_distilled_patch16_224) | 0.61111 | 0.432 | 0.30164 | 0.21026 | 0.18603 | 0.44001 | 0.39589 | 0.1213 |
Transformer (frcnn_bottomup_attention) | 0.61693 | 0.44336 | 0.31383 | 0.22263 | 0.2128 | 0.46285 | 0.4904 | 0.15042 |
- To train patch-based / bottom-up architecture:
python train.py (--bottom-up)
- To evalualte trained model:
python evaluate.py --weight=<checkpoint path> (--bottom-up)
Ideas from:
- CPTR: FULL TRANSFORMER NETWORK FOR IMAGE CAPTIONING (2021; Wei Liu, Sihan Chen et. al)
- Bottom-Up and Top-Down Attention for Image Captioning (2018; Peter Anderson et. al)
- https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py
- https://github.com/SamLynnEvans/Transformer
- https://nlp.seas.harvard.edu/2018/04/03/attention.html
- https://github.com/salaniz/pycocoevalcap
- https://huggingface.co/blog/how-to-generate
- https://github.com/krasserm/fairseq-image-captioning
- https://github.com/airsplay/py-bottom-up-attention