When doing source image DDIM inversion, should the text prompt be empty?
g-jing opened this issue · 5 comments
In the paper, when doing DDIM inversion, the text prompt is a description of the real image. But in your code, the text prompt is an empty string. Could you confirm that? Thanks a lot. By the way, great work!
Hi @g-jing, thanks for your attention. As shown on page 5 (the footnote part) of our manuscript, when
[1] Null-text inversion: https://arxiv.org/abs/2211.09794
[2] Plug-and-Play Diffusion Features: https://arxiv.org/abs/2211.12572
Your response is very detailed. I have some further questions:
- If you do not use a source prompt for source image reconstruction, how do you get P_s, I_s, and M_s in Equation 6?
- In equation 5, Why do you choose to replace K and V but keep Q? P2P paper found that the result of QK could represent the object mask, why don't you also QK but only use Q? Did you find Q*K results can not represent object shape in self-attention layer?
- For equation 6, if I understand correctly, the M_s is applied to the result of Q*K, then the resulting new attention map will multiply with V to get the f. Please correct me if I am wrong. Also, is this Mask step implemented in the code?
Thanks for your response!
Hi @g-jing,
- Actually, the cross-attention map with null text tokens can still be used to extract masks associated with the foreground object. Thus we can obtain the
$M_s$ in the source image. In Eq. 6,$I_s$ is the input real image and$P_s$ is the null text. Besides the mask extraction from cross-attention maps, the mask also can be obtained by existing segmentation models. - In Eq. 5, we use the query Q in the target image to query contents from the source image, since the query features of the source and target images are much similar (shown in Fig. 4(b)). In P2P, the cross-attention map can represent the object shape, thus layout-fixed editing can be performed by directly modifying the text prompt, yet it cannot perform content-consistent and non-rigid editing. In self-attention, we also find that the self-attention maps can maintain the image layout, which is similar to the observations in [1]. However, utilizing QK cannot maintain the source contents unchanged! In other words, the synthesized image is content-inconsistent. I will add some cases later.
- Your understanding is correct. The mask in the attention can query information in the restricted regions [object_source<-->object_target, background_source<-->background_target], thus the problem of confusion can be alleviated.
[1] Plug-and-Play Diffusion Features: https://arxiv.org/abs/2211.12572
Hi @ljzycmd ,
Thanks a lot! Besides replacing QK (mentioned above)and replacing (KV), did you test other types of replacement? Such as replace V or QV? Also, has the Mask step been implemented in the codebase yet?
Hi @g-jing, we also tried other types of replacement, and unsatisfying results can be obtained (I will add some cases here later). The mask extraction strategy from cross-attention maps is implemented in masactrl/masactrl.py
, thus you can refer to it for more details. Hope this can help you. 😃