Concatenating two captions in retrieval mode

In the paper, concatenating two caption in retrieval only negatively affects the performance, maybe we should just remove it? Also, I couldn't find all_last_embedding_idx anywhere else in this file.

fromage/fromage/models.py

Line 320 in 2652cc6

first_last_embedding_idx, second_last_embedding_idx = all_last_embedding_idx[i]

Thanks for letting me know. This was part of a piece of code that was supposed to be removed in the final version, but I didn't. It's been removed now. Thanks!

TY! Just one quick follow up: in Fig. 2 of the paper, FROMAGe seems to also have a "captioning" loss for ar text-generation in retrieval, I'm wondering if this is useful? as the LM is basically frozen and the learnable [RET] token is at the very end.

The captioning loss (or rather, the next-token prediction loss, since we don't have an image input) is useful because it trains the model to produce [RET]. Without this, the model will never produce [RET] during inference time with greedy/nucleus sampling, because it is a new token (and the pretrained LLM never saw it).

Got it. So the point here is to encourage the model to proactively retrieve some images (as in the demo fig.1)? I'm quite interested in how you create such training data. I'm guessing something like

Show me some pictures of a sparrow. Here are some photos of a sparrow [RET].

Correct. This is described in detail in Sec 3.2 of the paper, but basically we just append [RET] to the end of every caption: (there are probably better ways to do this)

fromage/fromage/data.py

Line 107 in 38cc18f

caption += '[RET]'

Thanks for letting me know. Yes, this is indeed forcing the model to emit a [RET] at the end all the time. But at this point it seems quite reasonable.