omriav/blended-diffusion

Question about training

JacksonCakes opened this issue · 4 comments

Hi, this is really an impressive work! Two question here.

  1. I would like ask is the overall process of the text-guided image editing is using only pre-trained model without any extra training or fine-tuning?
  2. If it does not required any further fine-tuning or training, what is the purpose of having diffusion guided loss (which combine loss from CLIP model and background preservation loss)?

Thanks in advance for your clarification!

Hi,

Thank you very much for the kind words!

Yes - there is no need for any further fine-tuning, we simply use the diffusion model as-is.
Essentially, the purpose of the diffusion model is to restrict the editing to the natural images domain.
We want the edit operation to correspond to the guiding text (this is what is CLIP being used for) and to be natural (this is why the diffusion model is being used for).

Hope it clarifies.

Omri

Thanks for you quick respond!

I think I should better rephrase my question 2.
I can see that you are using CLIP loss at each reverse denoising step to better guide the generation of seamless output between background and edited region. So when does the diffusion guided loss which is the combination of CLIP loss and background preservation loss in Algorithm 1 come into play?

Algorithm 1 is a weak baseline that we added in the paper and showed that algorithm 2 (AKA Blended Diffusion) produces better results with no need for background preservation loss.

Alright, thanks for your clarification!