About image prompt embeds
Opened this issue · 0 comments
Immore commented
Thank you for your excellent work. I wonder if the image prompt embedding (bs, 1, 768) here can really represent enough information? Or is it just to ensure the correct running of SD1.5? I think it's a bit strange to only perform cross attention on a sequence of length 1
According to some methods, such as AnyDoor (although it uses DINO), should we use last_ hidden_states (bs, 257, 1024) as CLIP prompt (bs, 257, 768) through a learnable linear projection layer?
Look forward to your reply! Thank you.