/Inpainting

Text-driven inpainting and video generation using stable diffusion in PyTorch

Primary LanguageJupyter NotebookMIT LicenseMIT

Inpainting

Overview

Inpainting.mp4

Text-driven impainting

Inpainting

  1. The input image is enlarged to provide more room for the inpainting process. However, this step can be overlooked if the input image does not feature a close-up view of the object.

  2. A mask is created for the input image. Given that the input images have a white background, the object is concealed with a black mask, leaving the background unaffected.

  3. The input image, mask and text prompt are passed to the inpainting model "stabilityai/stable-diffusion-2-inpainting" using Hugging Face diffusers.

Video Generation

  1. The inpainted images, originally of size 512 X 512, are expanded by adding 50 pixels in all four directions. This results in an image of size 562 X 562, which is subsequently resized back to 512 X 512.

  2. The resized image is then masked.

  3. The masked image, along with a text prompt, undergoes an inpainting process.

  4. This entire procedure is repeated iteratively to generate a sequence of 9 images, thereby creating a zooming-out effect.

High-quality GIFs for each example are available here

Input Image Inpainted Image
Generated

Product in a kitchen used in meal preparation

output.mp4
Generated

Bicycle on a street

output.mp4
Generated

Toaster placed on kitchen stand

output.mp4
Generated

Chair behind a table in a study room

output.mp4
Generated

Tent in a forest

output.mp4
Generated

A bottle of whisky on stand of a bar

output.mp4

Failure cases

The model fails to inpaint images involving humans.

Input Image Inpainted image
Generated

Person standing in a hall meeting people

Generated

Person standing in a hall meeting people

Installation

Conda

conda env create -f inpainting.yml

Alternative

conda create -n impainting python=3.11

conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=11.8 -c pytorch -c nvidia

pip install diffusers

pip install transformers

pip install opencv-python

Inpainting

cd src
python main.py --prompt "<your text prompt>" --image_path "<path to your image>" --upscale True

By default, the upscale parameter is set to False. This is used for images where the main object occupies a large portion of the image. The purpose of setting upscale to True is to reduce the size of the main object, thereby improving the inpainting process

Video Generation

Use the inpainted image to generate a zooming-out video.

  1. video.py generates frames for the video
  2. render.pygenerates GIF from frames. You can play with duration to change smoothness of videos.
cd src
python video.py --prompt "<your text prompt>" --image_path "<path to inpainted image>"

python render.py --path "<path to folder containing generated frames>"

For the experiments, the prompts used for video generation were same to those used during the inpainting process.


References

If you find this work useful, please cite:

@InProceedings{Rombach_2022_CVPR,
    author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
    title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {10684-10695}
}