kohjingyu/fromage

[RET] Embedding

pUmpKin-Co opened this issue · 3 comments

Hi~Thanks for your exciting work. It inspire me a lot.
There are some question after I reading your code and paper:

  • Any quantitative results on RET Embedding? (like a table or something)
  • I don't fully understand how the updates to the RET embedding are separate from the updates to the other embedding. I noticed that the grad in other embedding is masked off, but the parameter requires_grad should be False (eval) for all lm's at the beginning of training. In that case, shouldn't all embedding have no gradient?
  • The purpose of nomalizing the embedding?

I hope you can answer these questions. Thank you!

My bad. Question 2 has been answerd by #6.

Thanks for your kind words!

Any quantitative results on RET Embedding? (like a table or something)

The RET embedding is used for retrieval, so I think the retrieval recall@k results in the paper are relevant: Table 1 for VIST, Table 2 for VisDial, and Table 3 in the appendix on MS-COCO.

The purpose of nomalizing the embedding?

This is mostly just to ensure that it is the same magnitude as the other token embeddings, so it doesn't become too OOD. I didn't conduct ablations to check if this is necessary, however. It could be that it works similarly even if you don't normalize the RET embedding.

Hope that helps!

Thanks for you kind response!