IΒ²EBench: A Comprehensive Benchmark for Instruction-based Image Editing
Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, Rongrong Ji
- [2024.9.25] Accepted by NeurIPS 2024 π
- [2024.12.4] Release the multi-round editing evaluation π
Overview of IΒ²EBench, an automated system for evaluating the quality of editing results generated by instruction-based image editing (IIE) models. We collected a dataset of over 2000+ images from public datasets and annotated them with corresponding original editing instructions. To diversify the instructions, we used ChatGPT to generate varied versions. With the collected images and the original/diverse editing instructions, we utilized existing IIE models to generate edited images. Subsequently, we developed an evaluation methodology to automatically assess the adherence of edited images to the provided instructions under different dimensions. We also implemented human evaluation to obtain human preferences for editing results of different IIE models. Finally, we analyzed the correlation between automated evaluation and human evaluation, confirming alignment with human perception.
- Comparison of radar charts for IΒ²EBench scores in different dimensions using (a) original instructions and (b) diverse instructions.
- Comparison of radar charts for IΒ²EBench scores in different categories using (a) original instructions and (b) diverse instructions. The scores of all dimensions are normalized and averaged.
.
βββ EditBench
βββ EditData ### dataset with different dimensions
β βββ BGReplacement
β βββ ColorAlteration
β βββ Counting
β βββ Deblurring
β βββ DirectionPerception
β βββ HazeRemoval
β βββ Lowlight
β βββ NoiseRemoval
β βββ ObjectRemoval
β βββ RainRemoval
β βββ RegionAccuracy
β βββ Replacement
β βββ ShadowRemoval
β βββ SnowRemoval
β βββ StyleAlteration
β βββ WatermarkRemoval
βββ EditEval ### 1. with `diverse` editing instructions
### 2. evaluation results of 8 editing models(any2pix,
### hive,hqedit,iedit,instruct-diffusion,instructpix2pix,
### magicbrush,mgie) in every dimensions.
β βββ BGReplacement # including evaluation results of 8 editing models
β βββ ColorAlteration # including evaluation results of 8 editing models
β βββ Counting # ...
β βββ Deblurring
β βββ DirectionPerception
β βββ HazeRemoval
β βββ Lowlight
β βββ NoiseRemoval
β βββ ObjectRemoval
β βββ RainRemoval
β βββ RegionAccuracy
β βββ Replacement
β βββ ShadowRemoval
β βββ SnowRemoval
β βββ StyleAlteration
β βββ WatermarkRemoval
βββ EditEval_ori ### 1. with `original` editing instructions
### 2. evaluation results of 8 editing models(any2pix,
### hive,hqedit,iedit,instruct-diffusion,instructpix2pix,
### magicbrush,mgie) in every dimensions.
β βββ BGReplacement # including evaluation results of 8 editing models
β βββ ColorAlteration # including evaluation results of 8 editing models
β βββ Counting # ...
β βββ Deblurring
β βββ DirectionPerception
β βββ HazeRemoval
β βββ Lowlight
β βββ NoiseRemoval
β βββ ObjectRemoval
β βββ RainRemoval
β βββ RegionAccuracy
β βββ Replacement
β βββ ShadowRemoval
β βββ SnowRemoval
β βββ StyleAlteration
β βββ WatermarkRemoval
βββ EditRank ### 1. with `diverse` editing instructions
### 2. rank results of 8 editing models(any2pix,
### hive,hqedit,iedit,instruct-diffusion,instructpix2pix,
### magicbrush,mgie) in every dimensions based on evaluation results
β βββ BGReplacement.json # rank results of 8 editing models
β βββ ColorAlteration.json # rank results of 8 editing models
β βββ Counting.json # ...
β βββ Deblurring.json
β βββ DirectionPerception.json
β βββ HazeRemoval.json
β βββ Lowlight.json
β βββ NoiseRemoval.json
β βββ ObjectRemoval.json
β βββ RainRemoval.json
β βββ RegionAccuracy.json
β βββ Replacement.json
β βββ ShadowRemoval.json
β βββ SnowRemoval.json
β βββ StyleAlteration.json
β βββ WatermarkRemoval.json
βββ EditRank_ori ### 1. with `original` editing instructions
### 2. rank results of 8 editing models(any2pix,
### hive,hqedit,iedit,instruct-diffusion,instructpix2pix,
### magicbrush,mgie) in every dimensions based on evaluation results
β βββ BGReplacement.json # rank results of 8 editing models
β βββ ColorAlteration.json # rank results of 8 editing models
β βββ Counting.json # ...
β βββ Deblurring.json
β βββ DirectionPerception.json
β βββ HazeRemoval.json
β βββ Lowlight.json
β βββ NoiseRemoval.json
β βββ ObjectRemoval.json
β βββ RainRemoval.json
β βββ RegionAccuracy.json
β βββ Replacement.json
β βββ ShadowRemoval.json
β βββ SnowRemoval.json
β βββ StyleAlteration.json
β βββ WatermarkRemoval.json
βββ EditResult ### 1. with `diverse` editing instructions
### 2. editing results of 8 editing models(any2pix,
### hive,hqedit,iedit,instruct-diffusion,instructpix2pix,
### magicbrush,mgie) in every dimensions based on evaluation results
β βββ BGReplacement
β βββ ColorAlteration
β βββ Counting
β βββ Deblurring
β βββ DirectionPerception
β βββ HazeRemoval
β βββ Lowlight
β βββ NoiseRemoval
β βββ ObjectRemoval
β βββ RainRemoval
β βββ RegionAccuracy
β βββ Replacement
β βββ ShadowRemoval
β βββ SnowRemoval
β βββ StyleAlteration
β βββ WatermarkRemoval
βββ EditResult_ori ### 1. with `original` editing instructions
### 2. editing results of 8 editing models(any2pix,
### hive,hqedit,iedit,instruct-diffusion,instructpix2pix,
### magicbrush,mgie) in every dimensions based on evaluation results
β βββ BGReplacement
β βββ ColorAlteration
β βββ Counting
β βββ Deblurring
β βββ DirectionPerception
β βββ HazeRemoval
β βββ Lowlight
β βββ NoiseRemoval
β βββ ObjectRemoval
β βββ RainRemoval
β βββ RegionAccuracy
β βββ Replacement
β βββ ShadowRemoval
β βββ SnowRemoval
β βββ StyleAlteration
β βββ WatermarkRemoval
βββ eval_scripts ### scripts for evaluation
β βββ high_level_eval_stage1.py ## evaluation for high-level dimensions, e.g. BGReplacement.
# stage1: using LVLM(e.g. GPT4V) to ask edited images
# questions(from jsons in `EditData`), get the
# raw `VLM_judgement` outputs, saved in `EvalData`
β βββ high_level_eval_stage2_final_judge.py # stage2: using LLM(e.g. GPT4-turbo) with designed template
# to get the more stable `final_judgement`
β βββ low_level_eval.py ## evaluation for low-level dimensions, e.g. Deblurring
β βββ metrics_utils ## utils for evaluations, e.g. GPT4V, GPT4-turbo, CLIP, SSIM
β βββ sample_rank_gen.py ## generate script for `EditRank_ori` and `EditRank`
β βββ summary.json ## generated by `summary.py`
β βββ summary_ori.json ## generated by `summary.py`
β βββ summary.py ## generate script for `summary.json` and `summary.json`,
# describe metric scores for every models in every dimensions
β βββ summary_model_type_avg_score.json ## generated by `summary_model_type_avg_score.py`
β βββ summary_model_type_avg_score.py ## generate script for `summary_model_type_avg_score.json`,
# describe metric scores for every editing models
# in every dimensions
βββ readme.md
Check how to evaluate with my own editing model
@inproceedings{ma2024i2ebench,
title={I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing},
author={Ma, Yiwei and Ji, Jiayi and Ye, Ke and Lin, Weihuang and Zheng, Yonghan and Zhou, Qiang and Sun, Xiaoshuai and Ji, Rongrong and others},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}