Evaluating text-to-image generation using VQAScore with CLIP-FlanT5! This codebase contains the training code for CLIP-FlanT5.
[Project Page] [Code for evaluation] [Data] [Model Zoo]
Evaluating Text-to-Visual Generation with Image-to-Text Generation (Arxiv) [Paper]
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang*, Deva Ramanan*
- [06/13/2024] 🔥 We released training code for CLIP-FlanT5 for automated text-to-image evaluation!
Usage and License Notices: The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna, FlanT5, and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
We follow LLaVA-1.5 for installation. If you already installed the environment from LLaVA, there is no need to create a new environment.
- Clone this repository and navigate to CLIP-FlanT5 folder
git clone https://github.com/linzhiqiu/CLIP-FlanT5.git
cd CLIP-FlanT5
- Install Package (if you do not have llava environment installed already)
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases (if you do not have llava environment installed already)
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
- Install huggingface_hub
python -m pip install huggingface_hub
git pull
pip install -e .
Please check out our Model Zoo for public CLIP-FlanT5 checkpoints.
CLIP-FlanT5 consists of two stages: (1) feature alignment stage: use LLaVA-1.5 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen FlanT5; (2) vqa finetuning stage: use 150K LLaVA-chat data and around 515K VQA data from academic-oriented tasks.
CLIP-FlanT5 is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly. Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
We use a similar set of hyperparameters as LLaVA-1.5 in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.
- Pretraining
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
CLIP-FlanT5 | 256 | 1e-2 | 1 | 2048 | 0 |
- Finetuning
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
CLIP-FlanT5 | 96 | 2e-5 | 1 | 2048 | 0 |
The base model FlanT5, which is a strong QA model developed by Google, will be downloaded automatically when you run our provided training scripts. No action is needed.
You can download the 558K subset of the LAION-CC-SBU dataset with BLIP captions use in the LLaVA-1.5 paper here and unzip/put them under "playground/data/LLaVA-Pretrain". The final folder structure should look like:
playground/data/
├── LLaVA-Pretrain
│ └── blip_laion_cc_sbu_558k.json
│ └── images
Pretrain takes around 5 hours for CLIP-FlanT5-XXL on 8x A100 (80G) using the image resolution of 336px. It takes around 2 hours for LLaVA-v1.5-7B.
Training script with DeepSpeed ZeRO-2: clip-flant5-xxl-stage-1.sh
.
--mm_projector_type mlp2x_gelu
: the two-layer MLP vision-language connector.--vision_tower openai/clip-vit-large-patch14-336
: CLIP ViT-L/14 336px.
If you are using slurm environment, you can also use the slurm script (by changing the default partition name to your own #SBATCH --partition={your_own_partition}
) provided in clip-flant5-xxl-stage-1.slurm
.
- Prepare data
We flattened the LLaVA-1.5 mixture of data (please download from llava_v1_5_mix665k_flattened_multi_turn.json), and also download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, because LLaVA-1.5 uses
.jpg
format, please run fix_ocrvqa.py after downloading and unzipping - TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in ./playground/data
,
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
- Start training!
You may download the stage-1 pretrained projectors in Model Zoo.
Stage-2 VQA training takes around 80 hours for CLIP-FlanT5-XXL on 8x A100 (80G), due to the increased resolution to 336px and flattening the multi-turn conversations into single-turn. It takes around 60 hours for CLIP-FlanT5-XL on 8x A100 (40G).
Training script with DeepSpeed ZeRO-3: clip-flant5-xxl.sh
. Optionally, if you use slurm, then you may use clip-flant5-xxl.slurm
(make sure to change the default slurm partition).
New options to note:
--version
: the prompt template / conversational style to be used. In original LLaVA-1.5, stage-1 training uses "plain" and stage-2 training uses "v1". We modify these for FlanT5's encoder-decoder architecture but keep the same system message and roles. In addition, we find that split-text training (BLIP2-FlanT5 reference) works better for stage-1 training of CLIP-FlanT5. We recommend using "t5_plain_split_text" for stage-1 training, and "t5_v1" for stage-2 training.--mm_projector_type mlp2x_gelu
: we use the two-layer MLP vision-language connector following LLaVA-1.5.--vision_tower openai/clip-vit-large-patch14-336
: CLIP ViT-L/14 336px.--image_aspect_ratio pad
: this pads the non-square images to square, instead of cropping them; LLaVA-1.5 uses it to slightly reduce hallucination.--group_by_modality_length True
: LLaVA-1.5 uses this because their instruction tuning dataset contains both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct). It makes the training sampler only sample a single modality (either image or language) during training, which we observe to speed up training by ~25%, and does not affect the final outcome.
Please refer to the t2v_metrics repo which contains evaluation code for VQAScore using CLIP-FlanT5.
If you find it useful for your research and applications, please cite using this BibTeX:
@article{lin2024evaluating,
title={Evaluating Text-to-Visual Generation with Image-to-Text Generation},
author={Lin, Zhiqiu and Pathak, Deepak and Li, Baiqi and Li, Jiayao and Xia, Xide and Neubig, Graham and Zhang, Pengchuan and Ramanan, Deva},
journal={arXiv preprint arXiv:2404.01291},
year={2024}
}
- LLaVA-1.5: the codebase we built upon.