Modern image captioning relies heavily on extracting knowledge, from images such as objects, to capture the concept of a static story in the image. In this paper, we propose a textual visual context dataset for image captioning, where the publicly available dataset COCO Captions (Lin et al., 2014) has been extended with information about the scene (such as objects in the image). Since this information has textual form, it can be used to leverage any NLP task, such as text similarity or semantic relation methods, into captioning systems, either as an end-to-end training strategy or a post-processing based approach.
This repository contains the implementation of the paper Visual Semantic Relatedness Dataset for Image Captioning.
- Overview
- Visual semantic with BERT
- Dataset
- Visual semantic with pre-trained model
- Evaluation
- Citation
We enrich COCO-Captions with Textual Visual Context information. We use ResNet152, CLIP and Faster R-CNN to extract object information for each COCO-caption image. We use three filter approaches to ensure the quality of the dataset (1) Threshold: to filter out predictions where the object classifier is not confident enough, and (2) semantic alignment with semantic similarity to remove duplicated objects. (3) semantic relatedness score as Soft-Label: to guarantee the visual context and caption have strong relation, we use Sentence RoBERTa to give a soft label via cosine similarity and then we use a threshold to annotate the final label (if th ≥ 0.2, 0.3, 0.4 then [1,0]). Finally, to take advantage of the overlapping between the visual context and the caption, and to extract global information from each visual, we use BERT followed by a shallow CNN (Kim, 2014) to estimate the visual relatedness score.
For a quick start please have a look at this project page and Demo
VC1 | VC2 | VC3 | human annoated caption |
---|---|---|---|
cheeseburger | plate | hotdog | a plate with a hamburger fries and tomatoes |
bakery | dining table | website | a table having tea and a cake on it |
gown | groom | apron | its time to cut the cake at this couples wedding |
- Dowload Raw data with ID and Visual context -> original dataset with related ID caption train2014
- Downlod Data with cosine score-> soft cosine lable with th 0.2, 0.3, 0.4 and 0.5
- Dowload Overlaping visual with caption-> Overlap visual context and the human annotated caption
- Download Dataset (tsv file) 0.0-> raw data with hard lable without cosine similairty and with threshold cosine sim degree of the relation beteween the visual and caption = 0.2, 0.3, 0.4
- Download Dataset GenderBias-> man/woman replaced with person class label
Fine-tune BERT on the created dataset.
- Tensorflow 1.15.0
- Python 3.6
conda create -n BERT_visual python=3.6 anaconda
conda activate BERT_visual
pip install tensorflow==1.15.0
pip install --upgrade tensorflow_hub==0.7.0
Download BERT check point uncased_L-12_H-768_A-12
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip
git clone https://github.com/gaphex/bert_experimental/
like this BERT-CNN/uncased_L-12_H-768_A-12
and BERT-CNN/bert_experimental
Download dataset
wget https://www.dropbox.com/s/dh38xibtjpohbeg/train_all.zip
unzip train_all.zip
for Training
parser.add_argument('--train', default='train.tsv', help='beam serach', type=str,required=False)
parser.add_argument('--num_bert_layer', default='12', help='truned layers', type=int,required=False)
parser.add_argument('--batch_size', default='128', help='truned layers', type=int,required=False)
parser.add_argument('--epochs', default='5', help='', type=int,required=False)
parser.add_argument('--seq_len', default='64', help='', type=int,required=False)
parser.add_argument('--CNN_kernel_size', default='3', help='', type=int,required=False)
parser.add_argument('--CNN_filters', default='32', help='', type=int,required=False)
python BERT_CNN.py --train /train_0.4.tsv --epochs 5
for inference only, download pre-trained model
wget https://www.dropbox.com/s/ip7p0wiwkwvph5k/0.4_bert-cnn.zip
unzip 0.4_bert-cnn.zip
python eval.py
Although this approach is proposed to take the advantage of the dataset (e.g. visual semantic model), we also investigate the use of out-of-the-box tools to estimate the relatedness score between the short text (i.e. caption) and its environmental visual context (we call it visual classifier).
For this we follow similarity to probability based approach but
we use only the cosine similarity from a pre-trained model and the top-3 averaged prob (confidence) from the object classifier as:
where the main components of the visual semantics re-ranker:
with Pre-trained SBERT
python pre-trained/model.py --vis visual-context_label.txt --vis_prob visual-context_prob.txt --c caption.txt
pip install pycocoevalcap
Then run
python Evaluation/coco_eval.py --f Result_tune_BERT_0.4.json
For future work, we plan to estimate the visual relatedness score by employing unsupervised learning (i.e. contrastive learning). (work in progress)
Feel free to download the training data
- Download CC -> Caption dataset from Conceptinal Caption (CC) 2M (2255927 captions)
- Download CC+wiki -> CC+1M-wiki 3M (3255928)
- Download CC+wiki+COCO -> CC+wiki+COCO-Caption 3.5M (366984)
- Download COCO-caption+wiki -> COCO-caption +wiki 1.4M (1413915)
- Download COCO-caption+wiki+CC+8Mwiki -> COCO-caption+wiki+CC+8Mwiki 11M (11541667)
The details of this repo are described in the following paper. If you find this repo useful, please kindly cite it:
@inproceedings{XXXXXX,
title={ZZZZZZ},
author={XXXXX},
booktitle={XXXX},
year={2022}
}