About image prompt embeds

Question

About image prompt embeds

Opened this issue 9 days ago · 0 comments

Thank you for your excellent work. I wonder if the image prompt embedding (bs, 1, 768) here can really represent enough information? Or is it just to ensure the correct running of SD1.5? I think it's a bit strange to only perform cross attention on a sequence of length 1

According to some methods, such as AnyDoor (although it uses DINO), should we use last_ hidden_states (bs, 257, 1024) as CLIP prompt (bs, 257, 768) through a learnable linear projection layer?

Look forward to your reply! Thank you.