Clarification needed

Thanks for your great work! I wanna know why we need this operation below? We see that we only need half of the attn maps, for example if we have 8 heads then below for map_.size(0) we will have 16. But why do we have 16 in the first place considering we only have 8 heads each transformer block? Can you show me where does diffusers do this? Really confused, thank you!

daam/daam/trace.py

Line 215 in 119d8ff

map_ = map_[map_.size(0) // 2:] # Filter out unconditional

Ah yes, so the first half of the heads in the earlier layers operates on the unconditional latent embeddings (of classifier-free guidance), initialized here: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L639. Since we care about the text-conditional embeddings only, we throw away those nuisance attention heads. You can verify that this procedure is sensible by visualizing the unconditional heads, e.g., map_ = map_[:map_.size(0) // 2].

Thank you so much!