/Vary

Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.

Primary LanguagePython

Haoran Wei*, Lingyu Kong*, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang

Release

  • [2023/12/11] We released the online demo, have fun!
  • [2023/12/11] We released the codes of Vary (train and inference)!

Code License Data License Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of LLaMA, Vicuna, GPT-4, Qwen, and LLaVA.

Contents

Install

  1. Clone this repository and navigate to the Vary folder
git clone https://github.com/Ucas-HaoranWei/Vary.git
cd Vary
  1. Install Package
conda create -n vary python=3.10 -y
conda activate vary
pip install e .
  1. Install Flash-Attention
pip install ninja
pip install flash-attn --no-build-isolation

Vary Weights

  • Due to download speed issues with Baiduyun, we have temporarily closed the download link. Our weights will be reorganized and open source again in the next few days.
  • Download the CLIP-VIT-L in Hugging Face

Demo

  1. Update the CLIP-VIT path in the codes (/cache/vit-large-patch14/) to your path.

python vary/demo/run_qwen_vary.py  --model-name  /vary/model/path/ --image-file /an/image/file.png

Train

  • We currently do not plan to open source the weights of the intermediate.
  • However, we release the train codes. So you can train on your own dataset. If you want to do this, you can try this:
  1. For Vary-base (one machine, if you have multiple machines you need to prepare your host file)
deepspeed   Vary/train/train_qwen_vary.py  --deepspeed /Vary/zero_config/zero2.json
            --model_name_or_path /Qwen-7B/path/
            --vision_tower /vit-large-patch14/path/
            --freeze_vision_tower True
            --freeze_lm_model False
            --vision_select_layer  -2
            --use_im_start_end True
            --bf16 True
            --per_device_eval_batch_size 4
            --gradient_accumulation_steps 1
            --evaluation_strategy "no"
            --save_strategy "steps"
            --save_steps 5000
            --save_total_limit 1
            --weight_decay 0.
            --warmup_ratio 0.03
            --lr_scheduler_type "cosine"
            --logging_steps 1 --tf32 True
            --model_max_length 4096
            --gradient_checkpointing True
            --dataloader_num_workers 4
            --report_to none
            --per_device_train_batch_size 4
            --num_train_epochs 1
            --learning_rate 5e-5
            --datasets  data_name1+data_name2+data_name3
            --output_dir /path/to/output/
  1. For Vary-tiny
deepspeed   Vary/train/train_opt.py  --deepspeed /Vary/zero_config/zero2.json
            --model_name_or_path /opt125m/path/
            --conversation_version opt
            --freeze_vision_tower False
            --freeze_lm_model False
            --use_im_start_end True
            --bf16 True
            --per_device_eval_batch_size 4
            --gradient_accumulation_steps 1
            --evaluation_strategy "no"
            --save_strategy "steps"
            --save_steps 5000
            --save_total_limit 1
            --weight_decay 0.
            --warmup_ratio 0.03
            --lr_scheduler_type "cosine"
            --logging_steps 1 --tf32 True
            --model_max_length 4096
            --gradient_checkpointing True
            --dataloader_num_workers 4
            --report_to none
            --per_device_train_batch_size 16
            --num_train_epochs 1
            --learning_rate 5e-5
            --datasets  data_name1+data_name2+data_name3
            --output_dir /path/to/output/

Contact

If you have any questions related to the code or the paper, feel free to email (weihaoran18@mails.ucas.ac.cn).

Acknowledgement

  • LLaVA: the codebase we built upon!
  • Qwen: the LLM base model of Vary, which is good at both English and Chinese!

Citation

If you find our work useful in your research, please consider citing Vary:

@article{wei2023vary,
  author = {Haoran, Wei and Lingyu, Kong and Jinyue, Chen and Liang, Zhao and Zheng, Ge and Jinrong, Yang and Jianjian, Sun and Chunrui, Han and Xiangyu, Zhang},
  title = {Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  journal = {arXiv preprint arXiv:2312.06109},
  year = {2023},
}