Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang
We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By fully utilizing the LLMs' understanding paradigm of vision tokens, our method can compress hundreds of vision tokens into a single VoCo token, while minimizing visual information loss.
VoCo-LLaMA demonstrates the ability to understand video through continuous training using time-series compressed token sequences of video frames.
VoCo-LLaMA presents a promising way to unlock the full potential of VLMs' contextual window.
- [2024/06/17] Upload paper and release vision compression code.
- Clone this repository and navigate to VoCo-LLaMA folder
git clone https://github.com/Yxxxb/VoCo-LLaMA.git
cd VoCo-LLaMA
- Install Package
conda create -n voco_llama python=3.10 -y
conda activate voco_llama
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
cp VoCo-LLaMA/llava/model/language_model/cache_py/modeling_attn_mask_utils.py /data/miniconda3/envs/voco_llama/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py
VoCo-LLaMA training requires only visual instruction fine-tuning. Please download the aligned LLaVA checkpoints (base LLM and projection layers). Please download the annotation of the LLaVA instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in ./playground/data
,
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
VoCo-LLaMA is trained on 8 A100 GPUs with 40GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly. Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
Train VoCo-LLaMA with vision instruction tuning by running following command:
bash scripts/finetune.sh
There are evaluations about visual understanding we follow the relevant settings in LLaVA. Please refer to the LLaVA official repository for details of data setup and testing.
If you find this work useful, please consider citing our paper:
@article{ye2024voco,
author={Ye, Xubing and Gan, Yukang and Huang, Xiaoke and Ge, Yixiao and Shan, Ying and Tang, Yansong},
title={{VoCo-LLaMA: Towards Vision Compression with Large Language Models}},
journal={arXiv preprint arXiv:2406.12275},
year={2024},
}