VoCo-LLaMA: Towards Vision Compression with Large Language Models

Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang

TL;DR

We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By fully utilizing the LLMs' understanding paradigm of vision tokens, our method can compress hundreds of vision tokens into a single VoCo token, while minimizing visual information loss.

VoCo-LLaMA demonstrates the ability to understand video through continuous training using time-series compressed token sequences of video frames.

VoCo-LLaMA presents a promising way to unlock the full potential of VLMs' contextual window.

News

[2024/06/17] Upload paper and release vision compression code.

Preparation

Install

Clone this repository and navigate to VoCo-LLaMA folder

git clone https://github.com/Yxxxb/VoCo-LLaMA.git
cd VoCo-LLaMA

Install Package

conda create -n voco_llama python=3.10 -y
conda activate voco_llama
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation
cp VoCo-LLaMA/llava/model/language_model/cache_py/modeling_attn_mask_utils.py /data/miniconda3/envs/voco_llama/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py

Data and Pre-trained weights

VoCo-LLaMA training requires only visual instruction fine-tuning. Please download the aligned LLaVA checkpoints (base LLM and projection layers). Please download the annotation of the LLaVA instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:

COCO: train2017
GQA: images
OCR-VQA: download script, we save all files as .jpg
TextVQA: train_val_images
VisualGenome: part1, part2

After downloading all of them, organize the data as follows in ./playground/data,

├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2

Train

VoCo-LLaMA is trained on 8 A100 GPUs with 40GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Train VoCo-LLaMA with vision instruction tuning by running following command:

bash scripts/finetune.sh

Evaluation

There are evaluations about visual understanding we follow the relevant settings in LLaVA. Please refer to the LLaVA official repository for details of data setup and testing.

Citation

If you find this work useful, please consider citing our paper:

@article{ye2024voco,
  author={Ye, Xubing and Gan, Yukang and Huang, Xiaoke and Ge, Yixiao and Shan, Ying and Tang, Yansong},
  title={{VoCo-LLaMA: Towards Vision Compression with Large Language Models}},
  journal={arXiv preprint arXiv:2406.12275},
  year={2024},
}

Acknowledgement

LLaVA: the codebase we built upon.
Vicuna: our base model Vicuna-7B that has the amazing language capabilities!