About your paper.

Question

About your paper.

guyuchao opened this issue 2 years ago · 3 comments

You write in Section 4, "The pre-trained CLIP text encoder prepends a specialized token 〈sot〉 to P indicating the start of the text. During the text encoding process, this token receives global information about the prompt. This leads to 〈sot〉 obtaining a high probability in the token distribution defined in At."

However, due to the casual mask in CLIP training, the 〈sot〉 will not receive any information. Why this token receives global information about the prompt?

Answer 1 · 2023-02-02T15:40:23.000Z

But as in the code, it uses the upper triangle mask. How do you get that mask?

https://github.com/huggingface/transformers/blob/92ce53aab859012f7714dae6d6fce7a7d701e75f/src/transformers/models/clip/modeling_clip.py#L715

Answer 2 · 2023-02-09T15:40:28.000Z

Hi, @guyuchao, thanks for your interest! Just saw what you meant in the code, this will be fixed in the next revision (however, this still does not contradict the fact that the <sot> token obtains a high attention value for SD).

Answer 3 · 2023-02-10T01:46:28.000Z

Great. Actually it is true that token obtains a high attention value for SD.