TL;DR: A Python library providing utilities to evaluate text-to-image (T2I) models. It complements established benchmarks like HEIM by focusing on metrics that help developers fine-tune T2I models for a particular style or concept.
Running it is as easy as:
pip install image-eval
image-eval -g <generated-dataset> -r <reference-dataset> -p <prompts-file> -m all
See the installation section for more detailed instructions.
Image quality is subjective, and evaluating text-to-image (T2I) models is hard. But we can't make progress without being able to measure progress. We need standardized and robust tooling.
Training T2I models is particularly difficult because there are no metrics to inform you whether your model is converging. For instance, this is what a typical training loss looks like:
The goal of this repo is to bring back measurability and help you make informed decisions when building T2I models. For instance, we discovered that using CMMD as a validation metric during training on as little as 50 images can help you gauge how much progress your model is making. The plot shows the distance between a reference set and the generated set for various checkpoints:
Read more about our discoveries in the Tips and Tricks section, which we will update as we learn more.
We use the following categories for our metrics, inspired by LyCORIS:
- Fidelity: the extent to which generated images adhere to the target concept.
- Pairwise similarity between two images; in contrast, fidelity makes bulk comparisons between two datasets.
- Controllability: the model’s ability to generate images that align well with text prompts.
- Diversity: the variety of images that are produced from a single or a set of prompts.
- Image quality: the visual appeal of the generated images (naturalness, absence of artifacts or deformations).
We dared to list these aspects in the order in which we believe they can be meaningfully measured. Measuring the fidelity of one dataset to another is a much better-defined problem than measuring a universal and elusive image quality.
Here are the metrics we currently support:
Metric name | Category | Source |
---|---|---|
centroid_similarity |
fidelity | ours |
cmmd |
fidelity | paper |
lpips |
pairwise similarity | paper |
multi_ssim |
pairwise similarity | paper |
psnr |
pairwise similarity | paper |
uiqui |
pairwise similarity | paper |
clip_score |
controllability | paper |
image_reward |
controllability | paper |
human_preference_score |
controllability | repo |
vendi_score |
diversity | paper |
fid |
image quality | paper |
inception_score |
image quality | paper |
aesthetic_predictor |
image quality | repo |
Some of the metrics above rely on image embeddings in a modular way -- even though they were originally published using CLIP embeddings, we noticed that swapping embeddings might lead to better metrics in certain cases. We allow you to mix and match the metrics above with the following:
- CLIP is by far the most popular encoder. It was used by the original Stable Diffusion model.
- DINOv2. Compared to CLIP, which used text-guided pretraining (aligning images against captions), DINOv2 used self-supervised learning on images alone. Its training objective maximizes agreement between different patches within the same image. It was trained on a dataset of 142M automatically curated images.
- ConvNeXt V2. Similarly to DINOv2, ConvNeXt V2 did not use text-guided pretraining. It was trained on an image dataset to recover masked patches. In contrast to DINOv2 (which uses a ViT), ConvNeXt V2 uses a convolutional architecture. ConvNeXtV2 is the successor of MAE (Masked Auto Encoder) embeddings.
- InsightFace is particularly good at encoding human faces, and is very effective when fine-tuning T2I models for headshots.
We also provide a simple and ready-to-use Streamlit interface for performing human evaluation of model outputs on your local machine. We recommend to use this when the automated metrics are just not discriminative enough to help you decide betwen two checkpoints.
This library has been tested on Python 3.10.9. To install:
pip install image-eval
Optionally, if you have a CUDA-enabled device, install the version of PyTorch that matches your CUDA version. For CUDA 11.3, that might look like:
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
There are two ways to interact with the image-eval
library: either through the CLI or through the API.
Once you installed the library, you can invoke it through the CLI on your terminal via image_eval <flags>
. The full list of flags is in eval.py, but here are the most important ones:
-g
should point to a folder of generated images-r
should point to a folder of reference images-p
(needed for controllability metrics only) should point to a.json
file that storesimage_filename: prompt
pairs, for instance:
{
"image_1.jpg": "prompt for image 1",
"image_2.jpg": "prompt for image 2",
...
}
-m
should specify the desired metrics; it can beall
, a certain category (e.g.fidelity
) or a specific metric (e.gcentroid_similarity
).
For example, to calculate the fidelity of a generated dataset to some reference images, you would run
image_eval -m fidelity -g /path/to/generated/images -r /path/to/reference/images
The result will look like this:
| Metric Name | Value |
|---------------------------------+----------|
| centroid_similarity_clip | 0.844501 |
| centroid_similarity_dino_v2 | 0.573843 |
| centroid_similarity_convnext_v2 | 0.606375 |
| centroid_similarity_insightface | 0.488649 |
| cmmd_clip | 0.162164 |
| cmmd_dino_v2 | 0.1689 |
| cmmd_convnext_v2 | 0.187492 |
| cmmd_insightface | 0.169578 |
To launch the human evaluation interface, run:
image_eval --local-human-eval --model-predictions-json /path/to/model_comparisons.json
Here model_comparisons.json
is a JSON file with the following format:
[
{
"model_1": "path to image 1 from model 1",
"model_2": "path to image 1 from model 2",
"prompt": "prompt for image 1"
},
{
"model_1": "path to image 2 from model 1",
"model_2": "path to image 2 from model 2",
"prompt": "prompt for image 2"
},
...
]
where model_1
and model_2
are the keys for the paths to image outputs for the respective models. Our library does expect the keys to match these values exactly.
An interface should launch in your browser at http://localhost:8501
.
NOTE: When you click Compute Model Wins
a local file named scores.json
will be created in the directory from which you launched the CLI.
You can also interact with the library through the API directly. For example, to invoke the clip_score
metric, you could do the following:
from image_eval.evaluators import CLIPScoreEvaluator
evaluator = CLIPScoreEvaluator(device="cpu") # or "cuda" if you have a GPU-enabled device
images = [np.randint(0, 255, (224, 224, 3)) for _ in range(10)] # list of 10 random images
prompts = ["random prompt" * 10]
evaluator.evaluate(images, prompts)
In this section, we will share our tips on how to use existing metrics to bring some rigor to the art of fine-tuning T2I models. Please don't take anything as an absolute truth. If we knew things for sure, we would be writing a paper instead of a Github README. Suggestions and constructive feedback are more than welcome!
[TODO]
We welcome any and all contributions to this library, as well as discussions on how we can make the art of training T2I models more scientific.
To add a new metric, all you need to do is create a new class that inherits from the BaseEvaluator
and implements the evaluate
method. For examples of how our current metrics implement this contract, see evaluators.py
.
To add a new encoder, simply implement the BaseEncoder
interface (see encoders.py
).
Here are other notable resources for evaluating T2I models: