Authors: Junjie Fei, Teng Wang, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng Zheng
This repository contains the official implementation of our paper: Transferable Decoding with Visual Entities for Zero-Shot Image Captioning.
- Introduction
- Examples
- Citation
- Required Prerequisites
- Data Preparation
- Training
- Evaluation
- Inference
- Acknowledgments
- Contact
This paper aims at the transferability of the zero-shot captioning for out-of-domain images. As shown in this image, we demonstrate the susceptibility of pre-trained vision-language models and large language models to modality bias induced by language models when adapting them into image-to-text generation. Simultaneously, these models tend to generate descriptions containing objects that do not actually exist in the image but frequently appear during training, a phenomenon known as object hallucination. We propose ViECap, a transferable decoding model that leverages entity-aware decoding to generate descriptions in both seen and unseen scenarios. This is the official repository for ViECap, in which you can easily reproduce our paper's results and try it on your own images.
Here are some fantastic examples for diverse captioning scenarios of our model!
The captioning results on the NoCaps dataset are presented here:
Task | COCO |
COCO |
COCO |
COCO |
COCO |
Flickr30k |
COCO | Flickr30k |
---|---|---|---|---|---|---|---|---|
Metric | CIDEr | CIDEr | CIDEr | CIDEr | CIDEr | CIDEr | CIDEr | CIDEr |
MAGIC | ---- | ---- | ---- | ---- | 17.5 | 18.3 | 49.3 | 20.4 |
DeCap | 65.2 | 47.8 | 25.8 | 45.9 | 35.7 | 44.4 | 91.2 | 56.7 |
CapDec | 60.1 | 50.2 | 28.7 | 45.9 | 35.7 | 27.3 | 91.8 | 39.1 |
----- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
ViECap | 61.1 | 64.3 | 65.0 | 66.2 | 38.4 | 54.2 | 92.9 | 47.9 |
If you find our paper and code helpful, we would greatly appreciate it if you could leave a star and cite our work. Thanks!
@InProceedings{Fei_2023_ICCV,
author = {Fei, Junjie and Wang, Teng and Zhang, Jinrui and He, Zhenyu and Wang, Chengjie and Zheng, Feng},
title = {Transferable Decoding with Visual Entities for Zero-Shot Image Captioning},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {3136-3146}
}
@article{fei2023transferable,
title={Transferable Decoding with Visual Entities for Zero-Shot Image Captioning},
author={Fei, Junjie and Wang, Teng and Zhang, Jinrui and He, Zhenyu and Wang, Chengjie and Zheng, Feng},
journal={arXiv preprint arXiv:2307.16525},
year={2023}
}
For code execution, begin by cloning this repository and downloading the annotations, checkpoints, and evaluation files from the Releases of this repository. Afterward, unzip the files and position them within the root directory. It should be noted that we only run our codes on Linux.
git clone git@github.com:FeiElysia/ViECap.git
To utilize this code with your desired dataset, the initial step involves converting the dataset format through data preprocessing. Firstly, extract the entities from each caption within your chosen dataset using the following command (make sure you have placed all captions from the dataset into a list):
python entities_extraction.py
(Optional) you can pre-extract the training text features.
python texts_features_extraction.py
Using these two scripts, you can now transform any dataset you wish to use for training into the appropriate data format for the dataloader. Additionally, we have made the processed COCO dataset and Flickr30k dataset available in the Releases, feel free to use them directly!
To evaluate the trained ViECap, you should first construct the vocabulary and extract the embeddings of each category in the vocabulary. Utilize the vocabulary provided in the Releases and execute the following script (we also supply the extracted vocabulary embeddings here):
python generating_prompt_ensemble.py
(Optional) you can also acquire the image features beforehand for evaluation. Make sure to modify the script if you want to adapt it to your own dataset.
Note that if you choose not to use the provided image features from us, you should download the image source files for the COCO and Flickr30k dataset from their official websites. Afterwards, you should place these files into the 'ViECap/annotations/coco/val2014' directory for COCO images and the 'ViECap/annotations/flickr30k/flickr30k-images' directory for Flickr30k images.
python images_features_extraction.py
To train ViECap on the COCO dataset or the Flickr30k dataset, using the following script (bash train_*.sh n
), respectively:
bash train_coco.sh 0
bash train_flickr30k.sh 0
where n
represents the ID of gpu used (i.e., 'cuda:n').
Now, you can evaluate the captioning performance of your trained model on the testing dataset using the command bash eval_*.sh EXP_NAME n OTHER_ARGS m
, in which EXP_NAME
signifies the file name for storing checkpoints, OTHER_ARGS
signifies any other arguments used, and n
and m
refer to the GPU ID and the weight epoch used, respectively.
To evaluate the cross-domain captioning performance from COCO to NoCaps, run the following script:
bash eval_nocaps.sh train_coco 0 '--top_k 3 --threshold 0.2' 14
Task | COCO |
COCO |
COCO |
COCO |
COCO |
COCO |
COCO |
COCO |
---|---|---|---|---|---|---|---|---|
Metric | CIDEr | SPICE | CIDEr | SPICE | CIDEr | SPICE | CIDEr | SPICE |
DeCap | 65.2 | ---- | 47.8 | ---- | 25.8 | ---- | 45.9 | ---- |
CapDec | 60.1 | 10.2 | 50.2 | 9.3 | 28.7 | 6.0 | 45.9 | 8.3 |
----- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
ViECap | 61.1 | 10.4 | 64.3 | 9.9 | 65.0 | 8.6 | 66.2 | 9.5 |
To evaluate the cross-domain captioning performance from COCO to Flickr30k, run the following script:
bash eval_flickr30k.sh train_coco 0 '--top_k 3 --threshold 0.2' 14
Metric | BLEU@4 | METEOR | CIDEr | SPICE |
---|---|---|---|---|
MAGIC | 6.2 | 12.2 | 17.5 | 5.9 |
DeCap | 16.3 | 17.9 | 35.7 | 11.1 |
CapDec | 17.3 | 18.6 | 35.7 | ---- |
----- | ---- | ---- | ---- | ---- |
ViECap | 17.4 | 18.0 | 38.4 | 11.2 |
To evaluate the cross-domain captioning performance from Flickr30k to COCO, run the following script:
bash eval_coco.sh train_flickr30k 0 '--top_k 3 --threshold 0.2 --using_greedy_search' 29
Metric | BLEU@4 | METEOR | CIDEr | SPICE |
---|---|---|---|---|
MAGIC | 5.2 | 12.5 | 18.3 | 5.7 |
DeCap | 12.1 | 18.0 | 44.4 | 10.9 |
CapDec | 9.2 | 16.3 | 27.3 | ---- |
----- | ---- | ---- | ---- | ---- |
ViECap | 12.6 | 19.3 | 54.2 | 12.5 |
To evaluate the in-domain captioning performance on the COCO testing set, run the following script:
bash eval_coco.sh train_coco 0 '' 14
Metric | BLEU@4 | METEOR | CIDEr | SPICE |
---|---|---|---|---|
ZeroCap | 7.0 | 15.4 | 34.5 | 9.2 |
MAGIC | 12.9 | 17.4 | 49.3 | 11.3 |
DeCap | 24.7 | 25.0 | 91.2 | 18.7 |
CapDec | 26.4 | 25.1 | 91.8 | ---- |
----- | ---- | ---- | ---- | ---- |
ViECap | 27.2 | 24.8 | 92.9 | 18.2 |
To evaluate the in-domain captioning performance on the Flickr30k testing set, run the following script:
bash eval_flickr30k.sh train_flickr30k 0 '' 29
Metric | BLEU@4 | METEOR | CIDEr | SPICE |
---|---|---|---|---|
ZeroCap | 5.4 | 11.8 | 16.8 | 6.2 |
MAGIC | 6.4 | 13.1 | 20.4 | 7.1 |
DeCap | 21.2 | 21.8 | 56.7 | 15.2 |
CapDec | 17.7 | 20.0 | 39.1 | ---- |
----- | ---- | ---- | ---- | ---- |
ViECap | 21.4 | 20.1 | 47.9 | 13.6 |
For FlickrStyle10K, you can easily put it into practice by adhering to the aforementioned steps. Begin by downloading the dataset!
We have provided the captioning results in the Releases. You can evaluate them directly using bash language_eval.sh </path>
For example, if you wish to assess the cross-domain captioning performance from COCO to NoCaps, execute the following commands:
bash language_eval.sh ../checkpoints/train_coco/indomain_generated_captions.json
bash language_eval.sh ../checkpoints/train_coco/neardomain_generated_captions.json
bash language_eval.sh ../checkpoints/train_coco/outdomain_generated_captions.json
bash language_eval.sh ../checkpoints/train_coco/overall_generated_captions.json
you can describe any image you need according to the following script:
python infer_by_instance.py --prompt_ensemble --using_hard_prompt --soft_prompt_first --image_path ./images/instance1.jpg
The generated caption is: A little girl in pink pajamas sitting on a bed.
Change --image_path
to specify the path of any image you want to describe!
(Optional) you can also execute the following script to generate captions for all the images within a specific file.
python infer_by_batch.py --prompt_ensemble --using_hard_prompt --soft_prompt_first --image_path ./images
Our repository builds on CLIP, ClipCap, CapDec, MAGIC and pycocotools repositories. Thanks for open-sourcing!
If you have any questions, please feel free to contact me at: junjiefei@outlook.com.