VINHYU/CoSeR

Questions Regarding Your Paper

iGuoYanjun opened this issue · 2 comments

recently read your paper and found it extremely insightful. I have a few questions regarding specific parts of your paper, and I would greatly appreciate your clarification:

(1)In Figure 3 of the paper, in the left image of the first row, what does the "Token" number refer to? Does "All token" mean the 77 tokens from the CLIP text embedding?

(2)Could you please explain the meaning of 'supervised L' as mentioned in your paper? In the formula, what does 'Padding' refer to? Does the formula imply that when there are more class tokens, they are prioritized for supervision, and if insufficient, the original L (excluding class tokens) is used to supplement?

(3)In the supplementary materials, what is the significance of Experiment Figure B.6?

(4)Regarding the statement "It is noted that the setting of Te = 77 in L′ differs from using L for supervision. This distinction arises from the fact that the final tokens of L′ are expanded with class tokens when the caption is not sufficiently lengthy," could you please elaborate on what 'expand' means in this context?

VINHYU commented

Hello, thank you for your interest in our work. We also need to thank you for your help in identifying an inaccuracy in the paper! Formulas 2 and 4 in the paper should be expressed in the following form, which we think may solve most of your doubts:
Screen Shot 2023-12-20 at 11 46 42

(1) It refers to the number of tokens that are included preceding the cls token (inclusive).
"All token" means all tokens preceding the cls token (inclusive) instead of all 77 tokens.
(2) L denotes the CLIP language embedding extracted from the ground-truth caption. L' is the supervision of the cognitive encoder, which is defined by Formula (2).
'Padding' means that when the number of all the tokens before the cls token (inclusive) in L is less than Te, we use the cls token for end padding to Te.
That's not correct; the cls token is always included in the supervision. If tcls < Te, we do an end-fill with cls token; if tcls > Te, we take the Te tokens before the cls token in L (also including the cls token) as supervision.
(3) We want to use Figure B.6 to illustrate that employing L directly as supervision (instead of L') hinders the acquisition of cognitive information.
(4) The modified formula will help you understand the sentence. It should be noted that tcls varies with caption length. When Te = 77, we replace all tokens after the cls token in L with the cls token.

Thank you for addressing my inquiries about your paper. Your detailed responses have greatly enhanced my understanding and appreciation of your work. Hope your paper is accepted soon!