VQAScore for Text-to-Image Evaluation [Project Page]
TODO: Pick better teaser images because VQAScore still fail on current Winoground sample
Install the package via:
git clone https://github.com/linzhiqiu/t2i_metrics
cd t2i_metrics
conda create -n t2i python=3.10 -y
conda activate t2i
conda install pip -y
pip install torch torchvision torchaudio
pip install -e .
(not yet implemented) Or simply run pip install t2i_metrics
.
The following Python code is all you need to evaluate the similiarity between an image and a text (higher means semantically closer).
import t2i_metrics
clip_flant5_score = t2i_metrics.VQAScore(model='clip-flant5-xxl') # our best scoring model
# For a single (image, text) pair
image = "images/test0.jpg" # an image path in string format
text = "a young person kisses an old person"
score = clip_flant5_score(images=[image], texts=[text])
# Alternatively, if you want to calculate the pairwise similarity scores
# between M images and N texts, run the following to return a M x N score tensor.
images = ["images/test0.jpg", "images/test1.jpg"]
texts = ["an old person kisses a young person", "a young person kisses an old person"]
scores = clip_flant5_score(images=images, texts=texts) # scores[i][j] is the score between image i and text j
- GPU usage: The above scripts will by default use the first cuda device on your machine. We recommend using 40GB GPU for the largest VQA models such as
clip-flant5-xxl
andllava-v1.5-13b
. If you have limited GPU memory, consider using smaller models such asclip-flant5-xl
andllava-v1.5-7b
. - Cache directory: You can change the cache folder (default is
./hf_cache/
) by updatingHF_CACHE_DIR
in t2i_metrics/constants.py.
If you have a large dataset of M images x N texts, then you can optionally speed up inference using the following batch processing script.
import t2i_metrics
clip_flant5_score = t2i_metrics.VQAScore(model='clip-flant5-xxl')
# The number of images and texts per dictionary must be consistent.
# E.g., the below example shows how to evaluate 4 generated images per text
dataset = [
{'images': ["images/sdxl_0.jpg", "images/dalle3_0.jpg", "images/deepfloyd_0.jpg", "images/imagen2_0.jpg"], 'texts': ["an old person kisses a young person"]},
{'images': ["images/sdxl_1.jpg", "images/dalle3_1.jpg", "images/deepfloyd_1.jpg", "images/imagen2_1.jpg"], 'texts': ["a young person kissing an old person"]},
#...
]
scores = clip_flant5_score.batch_forward(dataset=dataset, batch_size=16) # (n_sample, 4, 1) tensor
For VQAScore, the question and answer can affect the final performance. We provide a simple default template for each model by default. For example, CLIP-FlanT5 and LLaVA-1.5 uses the below template which can be found at t2i_metrics/models/vqascore_models/clip_t5_model.py (we ignored the prepended system message for simplicity):
# {} will be replaced by the caption
default_question_template = "Is the image showing '{}'? Please answer yes or no."
default_answer_template = "Yes"
You can specify your own template by passing in question_template
and answer_template
to forward()
or batch_forward()
function:
# An alternative template for VQAScore
question_template = "Does the image show '{}'? Please answer yes or no."
answer_template = "Yes"
scores = clip_flant5_score(images=images,
texts=texts,
question_template=question_template,
answer_template=answer_template)
You can also compute P(caption | image) (VisualGPTScore) instead of P(answer | image, question):
vgpt_question_template = "" # no question
vgpt_answer_template = "{}" # simply calculate the P(caption)
scores = clip_flant5_score(images=images,
texts=texts,
question_template=vgpt_question_template,
answer_template=vgpt_answer_template)
We currently support CLIP-FlanT5, LLaVA-1.5, and InstructBLIP for VQAScore. We also support CLIPScore using CLIP, and ITMScore using BLIPv2:
llava_score = t2i_metrics.VQAScore(model='llava-v1.5-13b') # LLaVA-1.5 is the second best
clip_score = t2i_metrics.CLIPScore(model='openai:ViT-L-14-336')
blip_itm_score = t2i_metrics.ITMScore(model='blip2-itm')
You can check all supported models by running the below commands:
print("VQAScore models:")
print(t2i_metrics.list_all_vqascore_models())
print()
print("ITMScore models:")
print(t2i_metrics.list_all_itmscore_models())
print()
print("CLIPScore models:")
print(t2i_metrics.list_all_clipscore_models())
You can easily test on these vision-langauage benchmarks via running
python eval.py --model clip-flant5-xxl
python eval.py --model llava-v1.5-13b
python eval.py --model blip2-itm
python eval.py --model openai:ViT-L-14
# You can optionally specify question/answer template, for example:
python eval.py --model clip-flant5-xxl --question "Question: Is the image showing '{}'?" --answer "Yes"
You can easily implement your own scoring metric. For example, if you have a stronger VQA model, you can include it in t2i_metrics/models/vqascore_models. Please check out our implementation for LLaVA-1.5 and InstructBLIP as a starting point.
This repository is inspired from the Perceptual Metric (LPIPS) repository by Richard Zhang for automatic evaluation of image-to-image similiarity.