Paper | Project Page | Youtube Video
Official implementation of "Text-Driven Image Editing via Learnable Regions"
Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Lu Jiang, Ming-Hsuan Yang
Abstract: Language has emerged as a natural interface for image editing. In this paper, we introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. Specifically, our approach leverages an existing pre-trained text-to-image model and introduces a bounding box generator to find the edit regions that are aligned with the textual prompts. We show that this simple approach enables flexible editing that is compatible with current image generation models, and is able to handle complex prompts featuring multiple objects, complex sentences, or long paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. Experiments demonstrate the compet- itive performance of our method in manipulating images with high fidelity and realism that align with the language descriptions provided. Our project webpage: https://yuanze-lin.me/LearnableRegions_page.
- [2024.8.16] Release a demo on Colab and have fun playing with it 🎨.
- [2024.8.15] Code has been released.
To establish the environment, just run this code in the shell:
git clone https://github.com/yuanze-lin/Learnable_Regions.git
cd Learnable_Regions
conda create -n LearnableRegion python==3.9 -y
source activate LearnableRegion
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
conda env update --file enviroment.yaml
That will create the environment LearnableRegion
we used.
Run the following command to start editing a single image.
Since runwayml has removed its impressive inpainting model ('runwayml/stable-diffusion-inpainting'),
if you haven't stored it, please set --diffusion_model_path 'stabilityai/stable-diffusion-2-inpainting'
.
torchrun --nnodes=1 --nproc_per_node=1 train.py \
--image_file_path images/1.png \
--image_caption 'trees' \
--editing_prompt 'a big tree with many flowers in the center' \
--diffusion_model_path 'stabilityai/stable-diffusion-2-inpainting' \
--output_dir output/ \
--draw_box \
--lr 5e-3 \
--max_window_size 15 \
--per_image_iteration 10 \
--epochs 1 \
--num_workers 8 \
--seed 42 \
--pin_mem \
--point_number 9 \
--batch_size 1 \
--save_path checkpoints/
The editing results will be stored in $output_dir
, and the whole editing time of one single image is about 4 minutes with 1 RTX 8000 GPU.
You can tune max_window_size
, per_image_iteration
and point_number
for adjusting the editing time and performance.
The explanation for the introduced hyper-parameters from our method:
"image_caption": the caption of the input image, we just use class name in our paper.
"editing_prompt": the editing prompt for manipulating the input image.
"max_window_size": max anchor bounding box size.
"per_image_iteration": training iterations for each image.
"point_number": number of sampled anchor points.
"draw_box": whether to draw bounding boxes of results for visualization or not, it will be saved into$output_dir/boxes
.
Run the following command to start editing multiple images simultaneously.
If you haven't downloaded the inpaiting model 'runwayml/stable-diffusion-inpainting' before it was closed, please just set --diffusion_model_path 'stabilityai/stable-diffusion-2-inpainting'
.
torchrun --nnodes=1 --nproc_per_node=2 train.py \
--image_dir_path images/ \
--output_dir output/ \
--json_file images.json \
--diffusion_model_path 'stabilityai/stable-diffusion-2-inpainting' \
--draw_box \
--lr 5e-3 \
--max_window_size 15 \
--per_image_iteration 10 \
--epochs 1 \
--num_workers 8 \
--seed 42 \
--pin_mem \
--point_number 9 \
--batch_size 1 \
--save_path checkpoints/
Edit single custom image: please refer to the command from Edit Single Image
, and change image_file_path
, image_caption
, editing_prompt
accordingly.
Edit multiple custom images: please refer to images.json
to prepare the structure. Each key represents the input image's name,
the values are class/caption of the input image and editing prompt respectively, and then just run the above command from Edit Multiple Images
.
If you find our work useful in your research or applications, please consider citing our paper using the following BibTeX:
@inproceedings{lin2024text,
title={Text-driven image editing via learnable regions},
author={Lin, Yuanze and Chen, Yi-Wen and Tsai, Yi-Hsuan and Jiang, Lu and Yang, Ming-Hsuan},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={7059--7068},
year={2024}
}