[NeurIPS 2023] Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

Introduction

Aurora is an efficient PETL method used in multimodal large model fields. It uses mode approximation to further reduce the trainable parameters and promote the fusion of different modalities.

1. Comparison with other PETL methods

2. Overall architecture

Getting Started

Requirements

Python 3.8, PyTorch>=1.8.0, torchvision>=0.7.0, timm>=0.6.13, numpy>=1.21.0, transformers>=4.27.4 are required for the current codebase.

Datasets

1.Image-text Retrieval Task

COCO2014: download dataset through https://cocodataset.org/#download, you can use such Linux command [wget -c http://images.cocodataset.org/annotations/annotations_trainval2014.zip] to help you download directly.

Flickr30k: download dataset through https://shannon.cs.illinois.edu/DenotationGraph/data/index.html; or you can download through this link: https://pan.baidu.com/s/1r0RVUwctJsI0iNuVXHQ6kA, the password is hrf3.

2.Video-text Retrieval

MSRVTT: download the video dataset in https://www.mediafire.com/folder/h14iarbs62e7p/shared, and the corresponding annotation file in https://mega.nz/file/UnRnyb7A#es4XmqsLxl-B7MP0KAat9VibkH7J_qpKj9NcxLh8aHg.

DiDemo: download the dataset through this Github project https://github.com/jpthu17/EMCL.

3.Visual Question Answering Task

VQAv2: The COCO dataset can be downloaded through https://visualqa.org/download.html, and the additional VG dataset can be downloaded through this GitHub project https://github.com/jayleicn/ClipBERT.

VideoQA: The video dataset is come from MSRVTT, and the annotation file can be downloaded through this GitHub project https://github.com/jayleicn/ClipBERT.

Image-text Retrieval

Download COCO and Flickr30k datasets, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
To parameter-efficient finetune on MSCOCO/Flickr:

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py --config ./configs/retrieval_{coco, flickr}.yaml --output_dir output/{coco, flickr}

To evaluate on MSCOCO/Flickr:

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py --config ./configs/retrieval_{coco, flickr}.yaml --output_dir output/{coco, flickr} --evaluate

Visual Question Answering

Download VQAv2 dataset and Visual Genome dataset, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
To parameter-efficient finetune on VQAv2:

python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --config ./configs/vqa.yaml --output_dir $static_dir

To evaluate on VQAv2 (need to update the result file to the official server, the server website is [https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278]):

python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --config ./configs/vqa.yaml --output_dir $static_dir --evaluate

Video-text Retrieval and VideoQA

Download MSRVTT and DiDemo datasets, and set 'video_root' & 'ann_root' in configs/retrieval_{dataset}.yaml accordingly.
To parameter-efficient finetune on MSRVTT:

python -m torch.distributed.run --nproc_per_node=8 train_video_retrieval.py --config ./configs/retrieval_msrvtt.yaml --output_dir $static_dir

To parameter-efficient finetune on DiDemo:

python -m torch.distributed.run --nproc_per_node=8 train_video_retrieval.py --config ./configs/retrieval_didemo.yaml --output_dir $static_dir

To parameter-efficient finetune on VideoQA:

python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --config ./configs/videoqa.yaml --output_dir $static_dir

Acknowledgement

Our codebase is built based on BLIP, timm, and transformers. We thank the authors for the nicely organized code!

How To Cite Aurora

If you use this code in your research, please kindly cite the following paper:

@article{wang2023mode,
  title={Mode Approximation Makes Good Vision-Language Prompts},
  author={Wang, Haixin and Yang, Xinlong and Chang, Jianlong and Jin, Dian and Sun, Jinan and Zhang, Shikun and Luo, Xiao and Tian, Qi},
  journal={arXiv preprint arXiv:2305.08381},
  year={2023}
}

@inproceedings{wang2023parameter,
  title={Parameter-efficient Tuning of Large-scale Multimodal Foundation Model},
  author={Wang, Haixin and Yang, Xinlong and Chang, Jianlong and Jin, Dian and Sun, Jinan and Zhang, Shikun and Luo, Xiao and Tian, Qi},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023}
}