Pytorch implementation for the IEEE Access paper: EAES: Effective Augmented Embedding Spaces for Text-based Image Captioning
Fig 1. An overview of our proposed EAES embedding module includes an objects-augmented module for augmenting spatial location information between objects as well as OCR tokens and grid features augmentation module for augmenting the global context features of the image.
Clone this repository, and build it with the following command.
cd EAES_m4c
pip install -r requirements.txt
python setup.py build develop
- The TextCaps dataset is able to be downloaded at the official site.
- Grid features can be extracted from images via using the official repo at https://github.com/facebookresearch/grid-feats-vqa.
- BUTD features for visual objects and OCR tokens can found at the baseline M4C-Captioner repo at M4C-Captioner.
- RelDN, VinVL features can be extracted from images via using SGG benchmark toolbox which is available at https://github.com/microsoft/scene_graph_benchmark.
However, we have extracted these mentioned features for this study, which anyone can be access via the links below:
# | Type of feature | Features | Imdb file |
---|---|---|---|
1 | BUTD | updating | updating |
2 | RelDN | updating | updating |
3 | VinVL | updating | updating |
Run following command
CUDA_VISIBLE_DEVICES=[your gpu ids] python tools/run.py \
--tasks captioning \
--datasets m4c_textcaps \
--model m4c_captioner \
--config configs/captioning/m4c_textcaps/EAES_m4c_VinVLx154c4.yml \
--save_dir [save_dir]
Modify the save_dir
with the path you expect the trained model will be saved at.
Note that in m4c_captioner.yml
file you should modify the correct features path.
dataset_attributes:
m4c_textcaps:
image_features:
test:
- [image test feature path], [ocr test feature path]
train:
- [image train feature path], [ocr train feature path]
val:
- [image train feature path], [ocr train feature path]
imdb_files:
test:
- [your path]/imdb_test_filtered_by_image_id.npy
train:
- [your path]/imdb_train.npy
val:
- [your path]/imdb_val_filtered_by_image_id.npy
Replace image test feature path
, image train feature path
, ocr te feature path
, ocr train feature path
, with the features path for visual objects and ocr tokens you downloaded at Materials section
The inference process can be easily done by running following command:
CUDA_VISIBLE_DEVICES=[your gpu ids] python tools/run.py \
--tasks captioning \
--datasets m4c_textcaps \
--model m4c_captioner \
--config configs/captioning/m4c_textcaps/EAES_m4c_VinVLx154c4.yml \
--save_dir [your save path] \
--resume_file [your final checkpoint path] \
--evalai_inference 1 \
--run_type inference
The final checkpoint should be named m4c_captioner_final.pth
and is store at [your save path]/m4c_textcaps_m4c_captioner/
.
If you use any part of our code in your research, please cite our paper:
@ARTICLE{9732974,
author={Nguyen, Khang and Bui, Doanh C. and Trinh, Truc and Vo, Nguyen D.},
journal={IEEE Access},
title={EAES: Effective Augmented Embedding Spaces for Text-Based Image Captioning},
year={2022},
volume={10},
number={},
pages={32443-32452},
doi={10.1109/ACCESS.2022.3158763}}
This work was supported by the Multimedia Processing Lab (MMLab) and UIT-Together research group at the University of Information Technology, VNUHCM.
The code is greatly inspired by the M4C-Captioner.