-
This project is an implementation of Image Captiong model based on Transformer
-
Used Pytorch for the code.
ResNet152
is used for extracting the features. You can check pre-trained models here. -
Using COCO dataset
2017 Train/Val/Test
images, annotations. -
Please check
config.py
. Also, you can train on multi GPUs.
Prerequisites
- Clone this repo
git clone https://github.com/minjoong507/Image-Captioning-Transformer.git
cd Image-Captioning-Transformer
- Download COCO dataset
- After downloading the image/annotation data, you should put the image/annotation files in data dir.
mkdir data
data
├── annotations
├── ls
├── train2017
├── val2017
└── test2017
- Install packages:
- Python 3.8.5
- Pytorch 1.7.0+cu110
- nltk
- tqdm
- pycocotools
Training
- Add project root to PYTHONPATH
source setup.sh
- Build Vocabulary
python vocab/make_vocab.py
- Extract Image features
python feature_extraction/extraction.py
- Training Model
python train.py
Training using the above config will stop at epoch 100. I use single or six RTX 3090 GPU. result
dir containing the result of code. 2021-*
(=Start time) containing the saved model and train-log.txt.
Example
result
└──2021-04-16-12-00
├── model.ckpt
└── train-log.txt
Testing
python Inference --test_path MODEL_DIR_NAME
MODEL_DIR_NAME
is the name of the dir containing the saved model, e.g., result/2021-*.
-
Train loss & acc (100 epoch)
-
Single GPU :
- Accuracy : 98.6766 %
- Result
predict : a [UNK] with people is near a pier on clear water . [EOS] target : a [UNK] with people is near a pier on clear water . [EOS]
-
Six GPUs :
- Accuracy : 99.4794 %
- Result
predict : a picture of a giraffe standing in a zoo exhibit . [EOS] target : a picture of a giraffe standing in a zoo exhibit . [EOS] predict : people and buses on a city street under cloudy skies . [EOS] target : people and buses on a city street under cloudy skies . [EOS] predict : a man at an office desk drinking a glass of wine . [EOS] target : a man at an office desk drinking a glass of wine . [EOS] predict : two zebras are standing next to a log . [EOS] target : two zebras are standing next to a log . [EOS]
-
-
Inference
- An example of such file is pred.jsonl, formatted as:
{ "image_id": 00004983, "caption": "" }
pred.jsonl
will be saved ineval
dir.
- Description of the model and other details
- Code Refactoring
- Add Inference.py
Image-Captioning-Transformer
├── model
│ ├── data_loader.py
│ ├── layers.py
│ ├── model.py
│ └── optimization.py
├── data
│ ├── output_feature.pickle # after python extraction.py
│ ├── annotations
│ ├── ls
│ └── val2017
├── feature_extraction
│ ├── data_loader.py
│ ├── extraction.py
│ └── resnet.py
├── vocab
│ ├── vocab.pickle # after python make_vocab.py
│ ├── coco_idx.npy # after python extraction.py
│ └── make_vocab.py
├── LICENSE
├── .gitignore
└── README.md
- [1] TVCaption
- [2] huggingface/transformer