vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese

This repo GitHub contains our solution for vieCap4H Challenge 2021. In detail, we use grid features as visual presentation and pre-training a BERT-based language model from PhoBERT-based pre-trained model to obtain language presentation. Besides, we indicate a suitable schedule with the self-critical training sequence (SCST) technique to achieve the best results. Through experiments, we achieve an average of BLEU 30.3% on the public-test round and 28.9% on the private-test round, which ranks 3rd and 4th, respectively.

Figure 1. An overview of our solution based on RSTNet

0. Requirements

Install some necessary libraries via following command:

pip install -r requirements.txt

1. Data preparation

The grid features of vieCap4H can be downloaded via links below:

Dataset can be downloaded at Annotations must be converted to COCO format. We have already converted and it is available at:

2. Training

Vncorenlp service should be available via following command:

vncorenlp -Xmx2g data/VnCoreNLP-1.1.1.jar -p 9000 -a "wseg,pos,ner,parse"

Pre-training BERT-based model with PhoBERT-based

python \
--img_path <images path> \
--features_path <features path> \
--annotation_folder <annotations folder> \
--batch_size 40

Weights of BERT-based model should be appeared in folder saved_language_models

Then, continue to train Transformer model via command below::

python \
--img_path <images path> \
--features_path <features path> \
--annotation_folder <annotations folder> \
--batch_size 40

Weights of Transformr-based model should be appeared in folder saved_transformer_rstnet_models

Where <images path> is data folder, <features path> is the path of grid features folder, <annotations folder> is the path of folder that contains file viecap4h-public-train.json.

3. Inference

The results can be obtained via command below:


4. Reproduction

To implement our results on leaderboard, two pretrained models for BERT-based model, Transformer model can be downloaded via links below:

Besides, we also prepared our vocabulary file used for training and sample submission to arrange the predicted captions like the organizer.

Then, run the command line below for result reproduction:


Note: make sure that Vncorenlp server is available when running

If any parts of the source code are used, please acknowledge us:

	author = {Bui Doanh and Trinh Truc and Nguyen Thuan and Nguyen Vu and Nguyen Vo},
	title = { vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese},
	journal = {VNU Journal of Science: Computer Science and Communication Engineering},
	volume = {38},
	number = {2},
	year = {2022},
	keywords = {},
	abstract = {The automatic image caption generation is attractive to both Computer Vision and Natural Language Processing research community because it lies in the gap between these two fields. Within the vieCap4H contest organized by VLSP 2021, we participate and present a Transformer-based solution for image captioning in the healthcare domain. In detail, we use grid features as visual presentation and pre-training a BERT-based language model from PhoBERT-base pre-trained model to obtain language presentation used in the Adaptive Decoder module in the RSTNet model. Besides, we indicate a suitable schedule with the self-critical training sequence (SCST) technique to achieve the best results. Through experiments, we achieve an average of 30.3% BLEU score on the public-test round and 28.9% on the private-test round, which ranks 3rd and 4th, respectively. Source code is available at
	issn = {2588-1086},	doi = {10.25073/2588-1086/vnucsce.371},
	url = {//}