/MemVP

[ICML 2024] Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning

Primary LanguagePython

MemVP

Official code of ''Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning''

Environment

conda create -n memvp python==3.10
conda activate memvp
pip install -r requirements.txt
pip install -e .

TODO

  • Code of experiments on LLaMA.
  • Code of experiments on BART and T5.

Preparation

<your path>/
  |-- memvp
  |-- scripts
  |-- train.py
  |-- eval.py
  ......
  |-- data/
      |-- problem.json
      |-- pid_splits.json
      |-- captions.json
      |-- images
          |-- train          # ScienceQA train image
          |-- val            # ScienceQA val image
          |-- test           # ScienceQA test image
      |-- weights
          |-- tokenizer.model
              |--7B
                  |-- params.json
                  |-- consolidated.00.pth
              |--13B
                  |-- params.json
                  |-- consolidated.00.pth
                  |-- consolidated.01.pth

Fine-Tuning & Inference

# LLaMA-7B
bash scripts/finetuning_sqa_7b.sh
bash scripts/eval_sqa_7b.sh

# LLaMA-13B
bash scripts/finetuning_sqa_13b.sh
bash scripts/eval_sqa_13b.sh

Fine-tuning takes around 40 minutes for LLaMA-7B and 1 hour for LLaMA-13B on 8x A800 (80G).

Checkpoints

Acknowledgements

Citation

@article{jie2024memvp,
  title={Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning},
  author={Jie, Shibo and Tang, Yehui and Ding, Ning and Deng, Zhi-Hong and Han, Kai and Wang, Yunhe},
  journal={arXiv preprint arXiv:2405.05615},
  year={2024}
}