yuval-alaluf/Attend-and-Excite

About your paper.

guyuchao opened this issue · 3 comments

You write in Section 4, "The pre-trained CLIP text encoder prepends a specialized token 〈sot〉 to P indicating the start of the text. During the text encoding process, this token receives global information about the prompt. This leads to 〈sot〉 obtaining a high probability in the token distribution defined in At."

However, due to the casual mask in CLIP training, the 〈sot〉 will not receive any information. Why this token receives global information about the prompt?

Hi, @guyuchao, thanks for your interest! Just saw what you meant in the code, this will be fixed in the next revision (however, this still does not contradict the fact that the <sot> token obtains a high attention value for SD).

Great. Actually it is true that token obtains a high attention value for SD.