/DIFNet

[CVPR 2022] This repository is for the paper ``DIFNet: Boosting Visual Information Flow for Image Captioning'' .

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

DIFNet: A PyTorch Implementation

DIFNet

This repository contains the official code for our paper DIFNet: Boosting Visual Information Flow for Image Captioning (CVPR 2022).

If our work is helpful to you or gives some inspiration to you, please star this project and cite our paper. Thank you!

@inproceedings{wu2022difnet,
  title={DIFNet: Boosting Visual Information Flow for Image Captioning},
  author={Wu, Mingrui and Zhang, Xuying and Sun, Xiaoshuai and Zhou, Yiyi and Chen, Chao and Gu, Jiaxin and Sun, Xing and Ji, Rongrong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18020--18029},
  year={2022}
}

Installation

Clone the repository and create the difnet conda environment using the environment.yml file:

conda env create -f environment.yml
conda activate difnet

Then download spacy data by executing the following command:

python -m spacy download en

Add evaluation module from evaluation.

Note: Python 3.6+ and Pytorch 1.6+ are required to run our code.

Data preparation

To run the code, annotations and detection features for the COCO dataset are needed. Please download the annotations file annotations.zip and extract it.

Detection features are computed based on the project grid-feats-vqa. To reproduce our results, please extract the raw COCO grid features process them according to the project RSTNet. You can also alternatively download the processed image features coco_grid_feats with the extraction code cvpr for convenience.

Segmentation features are computed with the code provided by UPSNet. To reproduce our result, please download the segmentation features file segmentations.zip (~83M) and extract it.

Evaluation

To reproduce the results reported in our paper, download the pretrained model file DIFNet_lrp.pth and place it in the saved_transformer_models folder.

Run sh test.sh using the following arguments:

Argument Possible values
--exp_name Experiment name
--mode select a model mode, ['base', 'base_lrp', 'difnet', 'difnet_lrp']
--batch_size Batch size (default: 10)
--workers Number of workers (default: 0)
--features_path Path to detection features file
--pixel_path Path to pixel file
--annotation_folder Path to folder with COCO annotations

Expected output

Under output_logs/, you may also find the expected output of the evaluation code.

Training procedure

Run python train.py using the following arguments:

Argument Possible values
--exp_name Experiment name
--mode select a model mode, ['base', 'base_lrp', 'difnet', 'difnet_lrp']
--batch_size Batch size (default: 50)
--workers Number of workers (default: 4)
--head Number of heads (default: 8)
--warmup Warmup value for learning rate scheduling (default: 10000)
--resume_last If used, the training will be resumed from the last checkpoint.
--resume_best If used, the training will be resumed from the best checkpoint.
--features_path Path to detection features file
--pixel_path Path to segmentation feature file
--annotation_folder Path to folder with COCO annotations
--logs_folder Path folder for tensorboard logs (default: "tensorboard_logs")

mode

base: baseline model
base_lrp: baseline model with lrp
difnet: DIFNet
difnet_lrp: DIFNet with lrp

For example, to train our model with the parameters used in our experiments, use

sh train.sh

For test,

sh test.sh

For LRP(first generate caption.json file with generate_caption.py, and then use lrp_total.py to generate lrp_result.pkl file, finally use show_lrp.py to show lrp_result.),

sh lrp.sh

When the cache can't release, use(for example, nvidia0 for release GPU0)

fuser -v /dev/nvidia0 |awk '{for(i=1;i<=NF;i++)print "kill -9 " $i;}' | sh

Acknowledge

This repo is based on M^2 Transformer, the-story-of-heads and Transformer-Explainability.