TencentARC/MasaCtrl

Question about Attention

cuishuhao opened this issue · 12 comments

Thank you for sharing the job!
I wonder how the codes achieve the process from source attention to the target, is it achieved by https://github.com/TencentARC/MasaCtrl/blob/main/masactrl/masactrl.py#L35C25-L43?

Hi @cuishuhao, these codes perform the attention process. Specifically, as shown in https://github.com/TencentARC/MasaCtrl/blob/f4476b0adeb6d111a532aca5111457fc5b6e9f88/masactrl/masactrl.py#LL58C1-L59C138, the target image features serve as Q, and the K, V are obtained from the source image features to query contents from the source image.

Should K and V be ku and vu in the second line instead of kc vc then? Since we want to use the K and V from source

Hi @LWprogramming, u and c represent unconditional and conditional, respectively. And both conditional and unconditional parts account for the synthesis results with classifier-free guidance. Therefore, we query both parts from the source image. Note that ku and vu contains both source and target features, ku[ :num_heads] and vu[ :num_heads] are the features from the source image.

Hope this can help you. 😃

Oh, you're right, I misread that :) I hadn't properly understood the connection between __call__ and forward methods before. Looking at it more closely, it looks like in the notebook we call regiter_attention_editor_diffusers, which calls editor.__call__. But then it seems like the relevant logic for saving K and V from source is in AttentionStore, while MutualSelfAttentionControl inherits from AttentionBase instead of AttentionStore. How does it eventually connect?

Hi @LWprogramming, register_attention_editor_diffusers can replace the original attention-forward process with our modified one. Note that qu, ku, vu contains both the query, key, value of source and target images. Take qu as an example, qu[ :num_head] is the query of source image, and qu[num_heads: ] is the query of target image. This is applicable for ku, vu and the conditional part qc, kc, vc.

Ah, I see now-- in the originally linked code I'd overlooked that q, k, v for u and c all have 2 * num_heads instead of num_heads in a dimension. Thanks!

In this link, I do not understand why passing the entire qu. What is the intuitive explanation for passing the entire qu, i.e., using both source and target images?
My understanding is that both the source image and target image need to go through Attention, so qu is not qu[num_heads:]. In the Attention block, two images do not interfere with each other. Finally, we only output the target image. That is to say, we can use qu[num_heads:] too?

Hi @kingnobro, I am sorry for the confusion. Note that qu[:num_heads] is the query feature from the source image, and the qu[num_heads: ] is the query of the target image, while only the source key and value features serve as K, V in the attention process. Therefore, the source image can be reconstructed or synthesized, and the target image can query image contents from the source image. Since the two denoising processes are performed simultaneously in the current implementation, we cannot only use qu[num_heads: ] to generate the target image.

Maybe it works fine too, without chunking for u and c? I checked it and it turns out to be the same value with the current algorithm.

Hi @FerryHuang, I'd like to further validate the results without chunking the unconditional and conditional parts during the denoising process, and the results will be updated here. In our previous experiments, performing the mutual self-attention on two parts independently achieved better results than jointly.

Sorry, I don't really understand why the two denoising processes are performed simultaneously. In implementation

noise_pred = self.unet(model_inputs, t, encoder_hidden_states=text_embeddings).sample
, it seems that there is only one denoising process performed

Hi @kingnobro, I am sorry for the confusion. Note that qu[:num_heads] is the query feature from the source image, and the qu[num_heads: ] is the query of the target image, while only the source key and value features serve as K, V in the attention process. Therefore, the source image can be reconstructed or synthesized, and the target image can query image contents from the source image. Since the two denoising processes are performed simultaneously in the current implementation, we cannot only use qu[num_heads: ] to generate the target image.

Hi @TimelessXZY, the model_inputs consists of the noisy latent for the source branch and the target branch. The real editing is performed inner each hacked attention class (e.g., MutualSelfAttentionControl).