Clarification needed
devnkong opened this issue · 2 comments
Thanks for your great work! I wanna know why we need this operation below? We see that we only need half of the attn maps, for example if we have 8 heads then below for map_.size(0)
we will have 16. But why do we have 16 in the first place considering we only have 8 heads each transformer block? Can you show me where does diffusers do this? Really confused, thank you!
Line 215 in 119d8ff
Ah yes, so the first half of the heads in the earlier layers operates on the unconditional latent embeddings (of classifier-free guidance), initialized here: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L639. Since we care about the text-conditional embeddings only, we throw away those nuisance attention heads. You can verify that this procedure is sensible by visualizing the unconditional heads, e.g., map_ = map_[:map_.size(0) // 2]
.
Thank you so much!