kohjingyu/fromage

retrieval only mode

Opened this issue · 1 comments

Hi,
Thanks for sharing your great paper and code!
I am wondering about a use case on retrieval only mode (without dialogue or question ansewring).
is training the "Image-captioning" model benefits retrieval model? for example, when using images as context - if so, why is it better than the visual embedding of the retrieval model for the context images?
also, as part of the retrieval model, you have the cross entrophy loss vs the input caption. Is this loss benefitial for retrieval only mode?
Thanks,
Ofer

They're mostly independent, you can refer to Table 3 in the appendix of the paper for an ablation. We find that the captioning loss doesn't really affect retrieval performance that much.