Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
curl --output gpt2-pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin
Clone the repository and create the visualgpt
conda environmnet
conda env create -f environment.yml
conda activate visualgpt
Then download spacy data
python -m spacy download en
We provide the COCO dataset for downloading. Please download the annotations file annotations.zip and extract it.
and coco_detections.hdf5, in which the data is stored in a <key, value>
where key is the image id and value is a tensor (N, 2048). N it the number of detections
create the log folder mkdir logs
and start the training
python train_visualGPT.py --batch_size 50 --head 12 --tau 0.2 --features_path coco_detections.hdf5 --annotation_folder annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --lr 1e-4 --decoder_layer 12 --optimizer_type adamw --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data
This code used resources from Meshed Memory Transformer and Transformers
Please cite our paper from the following bibtex
@article{chen2021visualgpt,
title={VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining},
author={Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
journal={arXiv preprint arXiv:2102.10407},
year={2021}
}
@article{chen2021visualgpt,
title={VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning},
author={Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
journal={arXiv preprint arXiv:2102.10407},
year={2021}
}