
Image captioning with Visual Attention

This project follows this Tensorflow 2 tutorial. Instead of training on small dataset like in the tutorial, I train on whole MS-COCO-2014 dataset.


Environment tool: conda

In file environment.yml:

  • Remember to change /home/anhvd/miniconda3/envs/imgcap corresponding to your OS and username.
  • Check YOUR gpu's cudatoolkit and cudnn. You might need to change these:
    • cudatoolkit=10.1.243=h6bb024c_0
    • cudnn=7.6.5=cuda10.1_0


conda env create -f environment.yml

Or you can just create new environment with this cmd:

conda create -n imgcap python=3.6 tensorflow-gpu cudatoolkit=<version> cudnn=<version>


This model is trained on a single Tesla K80 12 GiB about 10 hours.

Step 1: run this python download_extract.py. It will download, prepare MS-COCO-2014 dataset; then tokenize, extract feature.

Step 2: run this python train.py. It will train model and save it.


Download pretrained_models.zip from this repos' latest release section. This file zip provides:

  • annotations/captions_train2014.json from MS-COCO-2014 dataset
  • my checkpoints folder ~ pretrained models

Open file inf.py then scroll down to this block of code, then edit image_file, possibly annotation_file and checkpoint_path. Then run with python inf.py

if '__main__' == __name__:
    image_file = 'surf.jpg'


    ts = time.time()
    feature_extractor = FeatureExtraction.build_model_InceptionV3()
    tokenizer, max_length = load_tokenizer(annotation_file)
    encoder, decoder = load_latest_imgcap(checkpoint_path)
    te = time.time()

    load_model_time = te - ts

    models = [feature_extractor, 
                tokenizer, max_length, 
                encoder, decoder]

    ts = time.time()
    print(inference(image_file, models))
    te = time.time()

    inference_time = te - ts
    print(f'Loading models takes {load_model_time} seconds')
    print(f'Inference takes {inference_time} seconds')

