
Implementation code of the work "Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning"

ExpansionNet v2: Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning

Implementation code for "Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning" [ BigData2023 ] [ Arxiv ], previously entitled as "ExpansionNet v2: Block Static Expansion in fast end to end training for Image Captioning".


You can test the model on generic images (not included in COCO) downloading the checkpoint here and launching the script demo.py:

python demo.py \
     	--load_path your_download_folder/rf_model.pth \
     	--image_paths your_image_path/image_1 your_image_path/image_2 ...

Some examples:

images are available in demo_material.


SacreEOS Signature: STANDARDwInit+Cider-D[n4,s6.0]+average[nspi5]+1.0.0.
Results are artifacts-free.

Online evaluation server results:

Captions B1 B2 B3 B4 Meteor Rouge-L CIDEr-D
c40 96.9 92.6 85.0 75.3 40.1 76.4 140.8
c5 83.3 68.8 54.4 42.1 30.4 60.8 138.5

Results on the Karpathy test split:

Model B@1 B@4 Meteor Rouge-L CIDEr-D Spice
Ensemble 83.5 42.7 30.6 61.1 143.7 24.7
Single 82.8 41.5 30.3 60.5 140.4 24.5

Predictions examples:

ONNX & TensorRT

The model supports now ONNX conversion and deployment with TensorRT. The graph can be generated using onnx4tensorrt/convert2onnx.py. Its execution mainly requires the onnx package but the onnx_runtime and onnx_tensorrt packages are optionally used for testing purposes (see convert2onnx.py arguments).

Assuming Generic conversion commands:

python onnx4tensorrt/convert2onnx.py --onnx_simplify true --load_model_path <your_path> &> output_onnx.txt &
python onnx4tensorrt/onnx2tensorrt.py &> output_tensorrt.txt &

Currently working only in FP32.


In this guide we cover all the training steps reported in the paper and provide the commands to reproduce our work.


  • python >= 3.7
  • numpy
  • Java 1.8.0
  • torch
  • torchvision
  • h5py

Installing whatever version of torch, torchvision, h5py, Pillow fit your machine should work in most cases.

One instance of requirements file can be found in requirements.txt, in case also TensorRT is needed use requirements_wTensorRT.txt. However they represent one working instance, specific versions of each package might not be required.

Data preparation

MS-COCO 2014 images can be downloaded here, the respective captions are uploaded in our online drive and the backbone can be found here. All files, in particular the dataset_coco.json file and the backbone are suggested to be moved in github_ignore_materal/raw_data/ since commands provided in the following steps assume these files are placed in that directory.


For the sake of transparency (at the cost of possibly being overly verbose) the complete commands are shown below, but only few arguments deserve a little bit of care for the reproduction of our work while most of them are automatically handled.

Logs are stored in output_file.txt, which is continuously updated until the process is complete (in Linux it may be handy the command watch -n 1 tail -n 30 output_file.txt). It is overwritten in each training phase, thus, before moving to the next one, make sure to save or make a copy if needed.

Lastly, in some configurations the batch size may look different compared to the one reported in the paper when argument num_accum is specified (default is 1). This is only a visual subtlety, which means that gradient accumulation is performed in order to satisfy the memory constraints of 40GB RAM of a single GPU.

1. Cross Entropy Training: Features generation

First we generate the features for the first training step:

cd ExpansionNet_v2_src
python data_generator.py \
    --save_model_path ./github_ignore_material/raw_data/swin_large_patch4_window12_384_22k.pth \
    --output_path ./github_ignore_material/raw_data/features.hdf5 \
    --images_path ./github_ignore_material/raw_data/MS_COCO_2014/ \
    --captions_path ./github_ignore_material/raw_data/ &> output_file.txt &

Even if it's suggested not to do so, the output_path argument can be replaced with the desired destination (this would require changing the argument features_path in the next commands as well). Since it's a pretty big file (102GB), once the first training is completed, it will be automatically overwritten by the remaining operations in case the default name is unchanged.

TIPS: if 100GB of memory is too much for your disk, add the option --dtype fp16 which saves arrays into FP16 so it requires only 50GB. It shouldn't change affect much the result. By default, we keep FP32 for conformity with the experimental setup of the paper.

2. Cross-Entropy Training: Partial Training

In this step the model is trained using the Cross Entropy loss and the features generated in the previous step:

python train.py --N_enc 3 --N_dec 3  \
    --model_dim 512 --seed 775533 --optim_type radam --sched_type custom_warmup_anneal  \
    --warmup 10000 --lr 2e-4 --anneal_coeff 0.8 --anneal_every_epoch 2 --enc_drop 0.3 \
    --dec_drop 0.3 --enc_input_drop 0.3 --dec_input_drop 0.3 --drop_other 0.3  \
    --batch_size 48 --num_accum 1 --num_gpus 1 --ddp_sync_port 11317 --eval_beam_sizes [3]  \
    --save_path ./github_ignore_material/saves/ --save_every_minutes 60 --how_many_checkpoints 1  \
    --is_end_to_end False --features_path ./github_ignore_material/raw_data/features.hdf5 --partial_load False \
    --print_every_iter 11807 --eval_every_iter 999999 \
    --reinforce False --num_epochs 8 &> output_file.txt &

3. Cross-Entropy Training: End to End Training

The following command trains the entire network in the end to end mode. However, one argument need to be changed according to the previous result, the checkpoint name file. Weights are stored in the directory github_ignore_materal/saves/, with the prefix checkpoint_ ... _xe.pth we will refer it as phase2_checkpoint below and in the later step:

python train.py --N_enc 3 --N_dec 3  \
    --model_dim 512 --optim_type radam --seed 775533   --sched_type custom_warmup_anneal  \
    --warmup 1 --lr 3e-5 --anneal_coeff 0.55 --anneal_every_epoch 1 --enc_drop 0.3 \
    --dec_drop 0.3 --enc_input_drop 0.3 --dec_input_drop 0.3 --drop_other 0.3  \
    --batch_size 16 --num_accum 3 --num_gpus 1 --ddp_sync_port 11317 --eval_beam_sizes [3]  \
    --save_path ./github_ignore_material/saves/ --save_every_minutes 60 --how_many_checkpoints 1  \
    --is_end_to_end True --images_path ./github_ignore_material/raw_data/MS_COCO_2014/ --partial_load True \
    --backbone_save_path ./github_ignore_material/raw_data/swin_large_patch4_window12_384_22k.pth \
    --body_save_path ./github_ignore_material/saves/phase2_checkpoint \
    --print_every_iter 15000 --eval_every_iter 999999 \
    --reinforce False --num_epochs 2 &> output_file.txt &

In case you are interested in the network's weights at the end of this stage, before moving to the self-critical learning, rename the checkpoint file from checkpoint_ ... _xe.pth into something else like phase3_checkpoint (make sure to change the prefix) otherwise it will be overwritten during step 5.

4. CIDEr optimization: Features generation

This step generates the features for the reinforcement step:

python data_generator.py \
    --save_model_path ./github_ignore_material/saves/phase3_checkpoint \
    --output_path ./github_ignore_material/raw_data/features.hdf5 \
    --images_path ./github_ignore_material/raw_data/MS_COCO_2014/ \
    --captions_path ./github_ignore_material/raw_data/ &> output_file.txt &

5. CIDEr optimization: Partial Training

The following command performs the partial training using the self-critical learning:

python train.py --N_enc 3 --N_dec 3  \
    --model_dim 512 --optim_type radam --seed 775533  --sched_type custom_warmup_anneal  \
    --warmup 1 --lr 1e-4 --anneal_coeff 0.8 --anneal_every_epoch 1 --enc_drop 0.1 \
    --dec_drop 0.1 --enc_input_drop 0.1 --dec_input_drop 0.1 --drop_other 0.1  \
    --batch_size 24 --num_accum 2 --num_gpus 1 --ddp_sync_port 11317 --eval_beam_sizes [5]  \
    --save_path ./github_ignore_material/saves/ --save_every_minutes 60 --how_many_checkpoints 1  \
    --is_end_to_end False --partial_load True \
    --features_path ./github_ignore_material/raw_data/features.hdf5 \
    --body_save_path ./github_ignore_material/saves/phase3_checkpoint.pth \
    --print_every_iter 4000 --eval_every_iter 99999 \
    --reinforce True --num_epochs 9 &> output_file.txt &

We refer to the last checkpoint produced in this step as phase5_checkpoint, it should already achieve around 139.5 CIDEr-D on both Validaton and Test set, however it can be still improved by a little margin with the following optional step.

6. CIDEr optimization: End to End Training

This last step again train the model in an end to end fashion, however it is optional since it only slightly improves the performances:

python train.py --N_enc 3 --N_dec 3  \
    --model_dim 512 --optim_type radam --seed 775533 --sched_type custom_warmup_anneal  \
    --warmup 1 --anneal_coeff 1.0 --lr 2e-6 --enc_drop 0.1 \
    --dec_drop 0.1 --enc_input_drop 0.1 --dec_input_drop 0.1 --drop_other 0.1  \
    --batch_size 24 --num_accum 2 --num_gpus 1 --ddp_sync_port 11317 --eval_beam_sizes [5]  \
    --save_path ./github_ignore_material/saves/ --save_every_minutes 60 --how_many_checkpoints 1  \
    --is_end_to_end True --images_path ./github_ignore_material/raw_data/MS_COCO_2014/ --partial_load True \
    --backbone_save_path ./github_ignore_material/raw_data/phase3_checkpoint \
    --body_save_path ./github_ignore_material/saves/phase5_checkpoint \
    --print_every_iter 15000 --eval_every_iter 999999 \
    --reinforce True --num_epochs 1 &> output_file.txt &


In this section we provide the evaluation scripts. We refer to the last checkpoint as phase6_checkpoint. In case the previous training procedures have been skipped, weights of one of the ensemble's model can be found here.

python test.py --N_enc 3 --N_dec 3 --model_dim 512 \
    --num_gpus 1 --eval_beam_sizes [5] --is_end_to_end True \
    --eval_parallel_batch_size 4 \
    --images_path ./github_ignore_material/raw_data/<your_coco_img_folder> \
    --save_model_path ./github_ignore_material/saves/phase6_checkpoint

The option is_end_to_end can be toggled according to the model's type.
It might be required to give permissions to the file ./eval/get_stanford_models.sh (e.g. chmod a+x -R ./eval/ in Linux).


If you find this repository useful, please consider citing (no obligation):

  title={Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning},
  author={Hu, Jia Cheng and Cavicchioli, Roberto and Capotondi, Alessandro},
  booktitle={2023 IEEE International Conference on Big Data (BigData)},
  organization={IEEE Computer Society}


We thank the PyTorch team and the following repositories:

special thanks to the work of Yiyu Wang et al.

We thank the user @shahizat for the suggestion of ONNX and TensorRT conversions.
We also thank the github users from the Issues section which provided valuable feedbacks, suggestions, and even found very insidious bugs.