huggingface/diffusion-models-class

Confusing inpaint

metamath1 opened this issue · 1 comments

My question is about the Inpaint section.

inpaint-unet

In the tutorial's illustration, the Inpainting UNET seems to take 'text embedding', 'noisy latents', 'inpainting mask', and 'timestep' as inputs. However, shouldn't 'init_image' also be included as an input? The 'init_image' is also being used as input in the tutorial's code.

prompt = "A small robot, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]

Another query I have is regarding the training of the inpainting model. Is the masked-image and the mask provided as additional channels together for conditioning?

Am I correct in understanding that the training involves adding noise to the input, which includes four additional channels for the encoded masked-image and one additional channel for the mask?

Thanks for bringing this up.

You are correct that the inpainting models typically take the (encoded) image along with the mask as additional conditioning. I modified the diagram to better show this: a951e50