omriav/blended-diffusion

Scribble-guided editing

wileewang opened this issue · 2 comments

Hi! I wonder if a loss such as MSE or LPIPS is used between the user-provided scribbles and the scribbled regions of $\widehat{x}_0$ , in addition to the CLIP loss. I am curious how the shapes and colors stay consistent when only text with no specific description, e.g., "blanket" in Fig 9, is given.

Hi,

Thank you for your interest in our work.
No - there is no need for MSE/LPIPS loss, the only signal for the scribbles comes from the partial nosing of the image (i.e. to noise the image to a certain noise level).
The shapes and the colors stay somewhat consistent because of the why the diffusion model operates - the initial stages generate a rough sketch of the image and the finer details are added later, so we can noise the image up to the point that preserves the colors/shapes. For more details please see Figure 32 in the paper.

I see. Thanks for your reminding.