Image Captioning Transformer

Intro

This project is an implementation of Image Captiong model based on Transformer
Used Pytorch for the code. ResNet152 is used for extracting the features. You can check pre-trained models here.
Using COCO dataset 2017 Train/Val/Test images, annotations.
Please check config.py. Also, you can train on multi GPUs.

Getting Started

Prerequisites

Clone this repo

git clone https://github.com/minjoong507/Image-Captioning-Transformer.git
cd Image-Captioning-Transformer

Download COCO dataset

After downloading the image/annotation data, you should put the image/annotation files in data dir.

mkdir data

data
├── annotations
├── ls
├── train2017
├── val2017
└── test2017

Install packages:

Python 3.8.5
Pytorch 1.7.0+cu110
nltk
tqdm
pycocotools

Training

Add project root to PYTHONPATH

source setup.sh

Build Vocabulary

python vocab/make_vocab.py

Extract Image features

python feature_extraction/extraction.py

Training Model

python train.py

Training using the above config will stop at epoch 100. I use single or six RTX 3090 GPU. result dir containing the result of code. 2021-*(=Start time) containing the saved model and train-log.txt.

Example

result
└──2021-04-16-12-00
    ├── model.ckpt
    └── train-log.txt

Testing

python Inference --test_path MODEL_DIR_NAME

MODEL_DIR_NAME is the name of the dir containing the saved model, e.g., result/2021-*.

Evaluation

Train loss & acc (100 epoch)

Single GPU :

Accuracy : 98.6766 %
Result

predict : a [UNK] with people is near a pier on clear water . [EOS]
target : a [UNK] with people is near a pier on clear water . [EOS]

Six GPUs :

Accuracy : 99.4794 %
Result

predict : a picture of a giraffe standing in a zoo exhibit . [EOS]
target : a picture of a giraffe standing in a zoo exhibit . [EOS]

predict : people and buses on a city street under cloudy skies . [EOS]
target : people and buses on a city street under cloudy skies . [EOS]

predict : a man at an office desk drinking a glass of wine . [EOS]
target : a man at an office desk drinking a glass of wine . [EOS]

predict : two zebras are standing next to a log . [EOS]
target : two zebras are standing next to a log . [EOS]

Inference
- An example of such file is pred.jsonl, formatted as:
```
{
    "image_id": 00004983,
    "caption": ""
}
```
- pred.jsonl will be saved in eval dir.

TODO List

Description of the model and other details
Code Refactoring
Add Inference.py

File

Image-Captioning-Transformer
├── model
│   ├── data_loader.py
│   ├── layers.py
│   ├── model.py
│   └── optimization.py
├── data
│   ├── output_feature.pickle # after python extraction.py
│   ├── annotations
│   ├── ls
│   └── val2017
├── feature_extraction
│   ├── data_loader.py
│   ├── extraction.py
│   └── resnet.py
├── vocab
│   ├── vocab.pickle # after python make_vocab.py
│   ├── coco_idx.npy # after python extraction.py
│   └── make_vocab.py
├── LICENSE
├── .gitignore
└── README.md

License

MIT License

Reference

[1] TVCaption
[2] huggingface/transformer