omriav/blended-diffusion

how clip model works

davvvy opened this issue · 4 comments

davvvy commented

Hi!

Thanks for your great work first.

I'm wondering how the clip model works in your code. As I learned from your code, especially image_editor.py, the input of clip model is -prompt and the output is the corresponding image. And clip is only used to compute distance(loss). However, it is supposed that clip model should take -prompt as input and then generate the corresponding image which is to be placed in the white part of the mask for final image generation. I didn't see any information about it. Maybe I missed that part. Could you please explain it?

Thanks!

omriav commented

Hi,
The CLIP model gets as an input the text prompt and produces a text embedding. This embedding is later being used to generate the image by the diffusion model.

davvvy commented

Hi!

What I understand is:
The text embedding is used to guide the loss(final distance of line 282 in image_editor.py) between text content and the generated part in corresponding mask region in the image produced by diffusion model. The text embedding is not encoded into an image to attend diffusion model process.

Is it right?

omriav commented

Right.

davvvy commented

Got it. Thanks!