Evaluation of Deep Generative models

The codebase for evaluation of deep generative models as presented in Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models.

We studied 41 generative models across a diverse range of image datasets and found:

The state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics when using the default Inception-V3 network.
Supervised networks do not provide a perceptual space that generalizes well for image evaluation, and neither do self-supervised methods from particular families.
DINOv2 provides such a generalized representation space and allows for much richer evaluation of generative models. Researchers should replace Inception-V3 in all evaluation metrics. We provide an extensive DINOv2 leaderboard below and have added the results to paperswithcode.com.
Generative models directly memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that currently proposed diagnostic metrics do not properly detect memorization.

Here we provide code to compute the following 15 generative evaluation metrics using 8 different encoder networks:

Metrics:

Fréchet Distance: FD
FD_∞
Spatial FID: sFID
Kernel Distance
Inception Score
FLS
Precision & Recall
Density & Coverage
Vendi score
AuthPct
C_T score
FLS-POG
Realism
Approximate Sliced Wasserstein: ASW

Encoders:


Our multifaceted investigation of generative evaluation shows that diffusion models are unfairly punished by the Inception network: they synthesize more realistic images as judged by humans and their diversity more closely resembles the training data, yet are consistently ranked worse than GANs on metrics computed in Inception-V3 representation space.

Installation & Usage

Installation

First clone this repository, then navigate to the directory and pip install to install all required packages.

git clone git@github.com:layer6ai-labs/dgm-eval
cd dgm-eval
pip install -e .

We recommend you do this in a conda environment:

conda create --name dgm-eval pip python==3.10
conda activate dgm-eval
git clone git@github.com:layer6ai-labs/dgm-eval
cd dgm-eval
pip install -e .

Usage

Computing metrics only requires the paths to either locally hosted image datasets or torchvision.datasets. Encoders are automatically downloaded. For example, the following will compute the Fréchet distance (fd), kernel distance (kd), precision/recall/density/coverage (prdc), and the C_T score (ct) using DINOv2 (default) as the encoder.

python -m dgm_eval path/to/training_dataset path/to/generated_dataset \
				--test_path path/to/test_dataset \
				--model dinov2 \
				--metrics fd kd prdc ct

See scripts/run_experiments.sh or run python dgm_eval -h for further details on commandline parameters. As we suggest in the paper, metrics should be reported using the default model size (DINOv2-ViT-L/14) for final leaderboard values, but tracking progress during training is a factor of 4 more efficient with DINOv2-ViT-B/14. To use this architecture instead simply add --arch vitb14 as a commandline parameter.

Local datasets should either be un-conditional:

local/path/
	000000.png
	000001.png
	...

or conditional:

local/path/
	0/
		000000.png
		000001.png
		...
	1/
		000000.png
		000001.png
		...
	...

The directory should only include image files. To download and use a dataset from torchvision.datasets, just specify the dataset and train/test string:

python dgm_eval CIFAR10:train CIFAR10:test

A full example is as follows:

python -m dgm_eval CIFAR10:train CIFAR10:test \
					--model dinov2 \
					--metrics fd kd prdc \
					--device cuda \
					--batch_size 256 \
					--nsample 512 
					
									
>>> ....
>>> Num real: 512 Num fake: 512
>>> fd: 862.53745
>>> kd_value: 0.01095
>>> kd_variance: 0.00000
>>> precision: 0.90430
>>> recall: 0.91797
>>> density: 0.97969
>>> coverage: 0.94141

Data Access

Images

All generated data shown in this work can be accessed at the following link:

drive.google.com/drive/folders/1X0MFaUta90d3zF9xG4KchjR-8SE0cT_7?usp=sharing

Including:

Datasets of 100,000 image samples from 41 generative models across CIFAR10/, imagenet256/, LSUN Bedroom/, and FFHQ256/.
Training & test data at 256 x 256 resolution
Generated datasets for controlled experiments presented in the Appendix can be found in toy-datasets/

Human Evaluation

Data for human evaluation of image realism can be found at data/human-evaluation-realism/

Dinov2 Leaderboard


DINOv2 is the best suited model for generative evaluation. Our extensive quantitative and qualitative assessments showed that it distills a generalized representation space suitable for evaluation of diverse image datasets. Metrics computed in DINOv2 space show much better alignment with human evaluation than those in Inception-V3 space.

We have included leaderboard values on paperswithcode (links), and list all metrics in a table below:

Visualizing Heatmaps

Heatmaps can be visualized for each model on any given image datasets by the following, with examples following:

python -m dgm_eval CIFAR10:train CIFAR10:test \
					 --model inception \
					 --metrics fd \
					 --device cuda \
					 --batch_size 256 \
					 --nsample 50000 \
					 --heatmaps

Images	Inception	DINOv2

Citing

If you use any part of this repository in your research, please cite the associated paper with the following bibtex entry:

Authors: George Stein, Jesse C. Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Leigh Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L. Caterini, J. Eric T. Taylor, Gabriel Loaiza-Ganem

@misc{stein2023exposing,
      title={Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models}, 
      author={George Stein and Jesse C. Cresswell and Rasa Hosseinzadeh and Yi Sui and Brendan Leigh Ross and Valentin Villecroze and Zhaoyan Liu and Anthony L. Caterini and J. Eric T. Taylor and Gabriel Loaiza-Ganem},
      year={2023},
      eprint={2306.04675},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

This data and code is licensed under the MIT License, copyright by Layer 6 AI.

Birch-san/dgm-eval