Awesome Evaluation of Visual Generation

This repository collects methods for evaluating visual generation.

Overview

What You'll Find Here

Within this repository, we collect works that aim to answer some critical questions in the field of evaluating visual generation, such as:

Model Evaluation: How does one determine the quality of a specific image or video generation model?
Sample/Content Evaluation: What methods can be used to evaluate the quality of a particular generated image or video?
User Control Consistency Evaluation: How to tell how well the generated images and videos align with the user controls or inputs?

Updates

This repository is updated periodically. If you have suggestions for additional resources, updates on methodologies, or fixes for expiring links, please feel free to do any of the following:

raise an Issue,
nominate awesome related works with Pull Requests,
We are also contactable via email (ZIQI002 at e dot ntu dot edu dot sg).

1. Evaluation Metrics of Generative Models
- 1.1. Evaluation Metrics of Image Generation
- 1.2. Evaluation Metrics of Video Generation
2. Evaluation Metrics of Condition Consistency
- 2.1 Evaluation Metrics of Multi-Modal Condition Consistency
- 2.2. Evaluation Metrics of Image Similarity
3. Evaluation Systems of Generative Models
4. Improving Visual Generation with Evaluation / Feedback / Reward
5. Quality Assessment for AIGC
6. Study and Rethinking
7. Other Useful Resources

1. Evaluation Metrics of Generative Models

1.1. Evaluation Metrics of Image Generation

Metric	Paper	Code
Inception Score (IS)	Improved Techniques for Training GANs (NeurIPS 2016)
Fréchet Inception Distance (FID)	GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NeurIPS 2017)
Kernel Inception Distance (KID)	Demystifying MMD GANs (ICLR 2018)
CLIP-FID	The Role of ImageNet Classes in Fréchet Inception Distance (ICLR 2023)
Precision-and-Recall	Improved Precision and Recall Metric for Assessing Generative Models (NeurIPS 2019)
Renyi Kernel Entropy (RKE)	An Information-Theoretic Evaluation of Generative Models in Learning Multi-modal Distributions (NeurIPS 2023)
CLIP Maximum Mean Discrepancy (CMMD)	Rethinking FID: Towards a Better Evaluation Metric for Image Generation (CVPR 2024)
Kernel-based Entropic Novelty (KEN)	An Interpretable Evaluation of Entropy-based Novelty of Generative Models (2024-02-27)

1.2. Evaluation Metrics of Video Generation

Metric	Paper	Code
FID-vid	GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NeurIPS 2017)
Fréchet Video Distance (FVD)	Towards Accurate Generative Models of Video: A New Metric & Challenges (arXiv 2018)

2. Evaluation Metrics of Condition Consistency

2.1 Evaluation Metrics of Multi-Modal Condition Consistency

Metric	Condition	Pipeline	Code	References
CLIP Score (`a.k.a.` CLIPSIM)	Text	cosine similarity between the CLIP image and text embeddings	PyTorch Lightning	CLIP Paper (ICML 2021). Metrics first used in CLIPScore Paper (arXiv 2021) and GODIVA Paper (arXiv 2021) applies it in video evaluation.
Mask Accuracy	Segmentation Mask	predict the segmentatio mask, and compute pixel-wise accuracy against the ground-truth segmentation mask	any segmentation method for your setting
DINO Similarity	Image of a Subject (human / object etc)	cosine similarity between the DINO embeddings of the generated image and the condition image		DINO paper. Metric is proposed in DreamBooth.

2.2. Evaluation Metrics of Image Similarity

Metrics	Paper	Code
Learned Perceptual Image Patch Similarity (LPIPS)	The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (2018-01-11) (CVPR 2018)
Structural Similarity Index (SSIM)	Image quality assessment: from error visibility to structural similarity (TIP 2004)
Peak Signal-to-Noise Ratio (PSNR)	-
Multi-Scale Structural Similarity Index (MS-SSIM)	Multiscale structural similarity for image quality assessment (SSC 2004)	PyTorch-Metrics
Feature Similarity Index (FSIM)	FSIM: A Feature Similarity Index for Image Quality Assessment (TIP 2011)

The community has also been using DINO or CLIP features to measure the semantic similarity of two images / frames.

There are also recent works on new methods to measure visual similarity (more will be added):

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data (2023-06-15)

3. Evaluation Systems of Generative Models

3.4. Evaluation of Video Generation

3.4.1. Evaluation of Text-to-Video Generation

Exposing AI-generated Videos: A Benchmark Dataset and a Local-and-Global Temporal Defect Based Detection Method (2024-05-07)
Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap (2024-04-21)
Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment (2024-03-18)
Sora Generates Videos with Stunning Geometrical Consistency (2024-02-27)
STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models (2024-01-30)
Towards A Better Metric for Text-to-Video Generation (2024-01-15)
VBench: Comprehensive Benchmark Suite for Video Generative Models (2023-11-29)
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation (2023-11-03)
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models (2023-10-17)
StoryBench: A Multifaceted Benchmark for Continuous Story Visualization (2023-08-22, NeurIPS 2023)

3.4.2. Evaluation of Image-to-Video Generation

I2V-Bench from ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation (2024-02-06)
AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI (2024-01-03)
VBench-I2V (2024-03) from VBench: Comprehensive Benchmark Suite for Video Generative Models (2023-11-29)

3.4.3. Evaluation of Talking Face Generation

Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation (2024-05-07)

3.5. Evaluation of Text-to-Motion Generation

MoDiPO: text-to-motion alignment via AI-feedback-driven Direct Preference Optimization (2024-05-06)

3.6. Evaluation of Model Trustworthiness

3.6.1. Evaluation of Visual-Generation-Model Trustworthiness

Towards Geographic Inclusion in the Evaluation of Text-to-Image Models (2024-05-07)
UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images (2024-05-06)
VBench-Trustworthiness (2024-03) from VBench: Comprehensive Benchmark Suite for Video Generative Models (2023-11-29)
Holistic Evaluation of Text-To-Image Models (2023-11-07)

3.6.2. Evaluation of Non-Visual-Generation-Model Trustworthiness

Not for visual generation, but related evaluations of other models like LLMs

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal (2024-02-06)

4. Improving Visual Generation with Evaluation / Feedback / Reward

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models (2024-05-01)
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning (2024-04-23)
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback (2024-04-11)
UniFL: Improve Stable Diffusion via Unified Feedback Learning (2024-04-08)
ByteEdit: Boost, Comply and Accelerate Generative Image Editing (2024-04-07)
Aligning Diffusion Models by Optimizing Human Utility (2024-04-06)
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching (2024-04-04)
VersaT2I: Improving Text-to-Image Models with Versatile Reward (2024-03-27)
Improving Text-to-Image Consistency via Automatic Prompt Optimization (2024-03-26)
RL for Consistency Models: Faster Reward Guided Text-to-Image Generation (2024-03-25)
AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation (2024-03-20)
A Dense Reward View on Aligning Text-to-Image Diffusion with Preference (2024-02-13, ICML 2024)
InstructVideo: Instructing Video Diffusion Models with Human Feedback (2023-12-19)
Rich Human Feedback for Text-to-Image Generation (2023-12-15, CVPR 2024)
DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback (2023-11-29)
Diffusion Model Alignment Using Direct Preference Optimization (2023-11-21)
Aligning Text-to-Image Diffusion Models with Reward Backpropagation (2023-10-05)
Directly Fine-Tuning Diffusion Models on Differentiable Rewards (2023-09-29)
Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback (2023-07-10, NeurIPS 2023)
DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models (2023-05-25, NeurIPS 2023)
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation (2023-04-12)
Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models (2023-04-02, ICLR 2024)
Human Preference Score: Better Aligning Text-to-Image Models with Human Preference (2023-03-25)

5. Quality Assessment for AIGC

5.1. Image Quality Assessment for AIGC

5.2. Aesthetic Predictors for Generated Images

Multi-modal Learnable Queries for Image Aesthetics Assessment (2024-05-02, ICME 2024)
Aesthetic Scorer extension for SD Automatic WebUI (2023-01-15)
Simulacra Aesthetic-Models (2022-07-09)
LAION-Aesthetics_Predictor V2: CLIP+MLP Aesthetic Score Predictor (2022-06-26)
LAION-Aesthetics_Predictor V1 (2022-05-21)