EditWorld: Simulating World Dynamics for Instruction-Following Image Editing

News

June 23, 2024

After consulting with the sponsors, we have released a training dataset that has not been manually rechecked. The dataset link is EditWorld_data. Best of luck with your research!

Overview

This repository contains the official implementation of our EditWorld. In this work, we introduce a new task namely world-instructed image editing, which defines and categorizes the instructions grounded by various world scenarios. We curate a new image editing dataset with world instructions using a set of large pretrained models (e.g., GPT-3.5, Video-LLava and SDXL). We also propose a new post-edit method for world-instructed image editing.

World Instruction vs. Traditional Instruction

Generated Results of Our EditWorld:

Planning

[√] Providing full pipeline of text-to-image generation for EditWorld dataset.
[√] Releasing evaluation dataset.
[√] Releasing basic training dataset.
Releasing Checkpoints.
Releasing training and post-edit code.

Codebase

Text-to-image generation branch

Firstly, we employ GPT-3.5 to provide textual quadruples:

python gpt_script/text_img_gen_aigcbest_full.py --define_json gpt_script/define_sample_history/define_sample.json --output_path gpt_script/gen_sample_history/ --output_json text_gen.json

Then, we transform the text prompt provided by GPT into dict:

python tools/deal_text2json.py --input_json gpt_script/gen_sample_history/text_gen.json --output_json text_gen_full.json

Finally, we obtain the input-instruct-output triples based on the generated textual quadruples:

python t2i_branch_base.py --text_json text_gen_full.json --save_path datasets/editworld/generated_img/

It is worth noting that t2i_branch_base.py is the fast and basic version for text-to-image generation branch, we will improve this part in the future.

Video branch

Path video_script contains the code for downloading videos from the InternVid.

Dataset

Dataset structure

To obtain the training dataset file train.json, utilize the script located at tools/obtain_datasetjson.py. The dataset is organized in the following structure:

datasets/
├── editworld/
│   ├── generated_img/
│   │   ├── group_0/
│   │   │   ├── sample0_ori.png
│   │   │   ├── sample0_tar.png
│   │   │   ...
│   │   │   └── img_txt.json
│   │   └── group_1/
│   │   ...
│   ├── video_img/
│   │   ├── group_0/
│   │   │   ├── sample0_ori.png
│   │   │   ├── sample0_tar.png
│   │   │   ...
│   │   │   └── img_txt.json
│   │   └── group_1/
│   │   ...
│   └── human_select_img/
│       ├── group_0/
│       │   ├── sample0_ori.png
│       │   ├── sample0_tar.png
│       │   ...
│       │   └── img_txt.json
│       └── group_1/
│       ...
└── train.json

Evaluation dataset link

Our evaluation dataset is available at editworld_test.

Quantitative Comparison of CLIP Score and MLLM Score

IP2P: InstructPix2Pix; MB: MagicBrush. Bold results are the best.

CLIP Score of Text-to-image Branch

Category	IP2P	MB	Editworld	w/o post-edit
Long-Term	0.2140	0.1870	0.2244	0.2294
Physical-Trans	0.2186	0.2101	0.2385	0.2467
Implicit-Logic	0.2390	0.2432	0.2542	0.2440
Story-Type	0.2063	0.2070	0.2534	0.2354
Real-to-Virtual	0.2285	0.2344	0.2524	0.2435

CLIP Score of Video Branch

Category	IP2P	MB	Editworld	w/o post-edit
Spatial-Trans	0.2175	0.1997	0.2420	0.2286
Physical-Trans	0.2315	0.2278	0.2467	0.2483
Story-Type	0.2318	0.2262	0.2365	0.2399
Exaggeration	0.2416	0.2328	0.2443	0.2433

MLLM Score of Both Branches

Category	IP2P	MB	Editworld	w/o post-edit
Text-to-image	0.8763	0.8455	0.8958	0.9060
Video	0.9493	0.9715	0.9920	0.9891

Citation

@article{yang2024editworld,
  title={EditWorld: Simulating World Dynamics for Instruction-Following Image Editing},
  author={Yang, Ling and Zeng, Bohan and Liu, Jiaming and Li, Hong and Xu, Minghao and Zhang, Wentao and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2405.14785},
  year={2024}
}

YangLing0818/EditWorld