Image Captioning with Transformer

This project applies Transformer-based model for Image captioning task. In this study project, most of the work are reimplemented, some are adapted with lots of modification. The purpose of this project is to test the performance of the Transformer architecture and Bottom-Up feature, I conduct experiment and compare two different ways to extract features from visual input (image) and encode it to a sequence.

Notebook:

The following figure gives an overview of the baseline model architectures.

There are 2 ways to embed visual inputs:

In the patch-based architecture, image features can be extracted by split the image into patches (16x16), then flatten (same method as Vision-Transformer). Or extracted using a fixed grid tiles (8x8) follows an Inception V3 model.

Patch-based Encoders

In the architecture which uses bottom-up attention, FasterRCNN is used to extract features for each detected object in the image. This method captures visual meanings with object-aware semantics and generates some very good captions (in my opinion though).

Bottom-Up Encoder

Vocabulary can be built in two ways:

Use AutoTokenizer from Huggingface Transformer librabry
From scratch (suggested if using small dataset)

Extract features and save as numpy array

To extract features from InceptionV3, use preprocess/grid/cnn/preprocess.py
To extract bottom-up features, I provide Colab Notebook which adapts code from Detectron model

Datasets

I train both the patch-based and bottom-up models on Flickr30k dataset which contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators for each image. Download COCO-format Flickr30k

For COCO captioning data format, see COCO format

Results

The results shown here are recorded after training for 100 epochs, on validation split. Captions are generated by using Beam search with width of size 3.

Model	Bleu_1	Bleu_2	Bleu_3	Bleu_4	METEOR	ROUGE_L	CIDEr	SPICE
Transformer (deit_tiny_distilled_patch16_224)	0.61111	0.432	0.30164	0.21026	0.18603	0.44001	0.39589	0.1213
Transformer (frcnn_bottomup_attention)	0.61693	0.44336	0.31383	0.22263	0.2128	0.46285	0.4904	0.15042

Images	Caption with beam size = 3
	Bottom-up: A man sits on a bench with a newspaper Patch-based (flatten): A man in a hat and a hat is sitting on a bench
	Bottom-up: A snow boarder in a red jacket is jumping in the air Patch-based (flatten): A snow boarder in a yellow shirt is jumping over a snowy hill
	Bottom-up: A man is sitting on a chair with a basket full of bread in front of him Patch-based (flatten): A woman is selling fruit at a market
	Bottom-up: A group of people are playing music in a dark room Patch-based (flatten): A man in a black shirt is standing in front of a large crowd of people
	Bottom-up: A man in a red uniform is riding a white horse Patch-based (flatten): A man in a red shirt and white pants is riding a white horse

Usage

To train patch-based / bottom-up architecture:

python train.py (--bottom-up)

To evalualte trained model:

python evaluate.py --weight=<checkpoint path> (--bottom-up)

Paper References

Ideas from:

CPTR: FULL TRANSFORMER NETWORK FOR IMAGE CAPTIONING (2021; Wei Liu, Sihan Chen et. al)
Bottom-Up and Top-Down Attention for Image Captioning (2018; Peter Anderson et. al)

kaylode/caption-transformer

Image Captioning with Transformer

Datasets

Results

Usage

Paper References

Code References