On the visual token added to linguistic tokens in VLBertEmbeddings class

Hello.

I have a question about the VLBertEmbeddings class.

In its forward function, a global image feature is added into linguistic tokens
The last token in vision sequence is used as the global image feature like bellow:

volta/volta/embeddings.py

Line 271 in 9e52021

    
           text_visual_embeddings = final_feats[:, -1].repeat(1, seq_length).view(batch_size, seq_length, -1)

Using the last token seems reasonable for the original VLBert (vl-bert_base.json) because add_global_imgfeat is last,
but I think this should be the first token for the controlled VLBert (ctrl_vl-bert_base.json), whose add_global_imgfeat is first.

Are there any reason that the last token is always used in the class?

I'm sorry if I misunderstand the way the embeddings classes work.

Thanks.

Hi Iki-san,

Thanks for pointing this out, you are right.

I don't think this impacts performance much but I'll try to fix it.
I cannot however just fix the code as it will affect the controlled VL-BERT model that we released.

So, I'll need to find some time and resources to pre-train the controlled VL-BERT again.

I'll keep this issue open until I do so.

If you are pre-training VL-BERT, go ahead and fix the indexing problem :)

Thank you for your kind answer.
I agree with you.
Although I'm curious about its impact, considering the cost of the pre-training, I do not think it is urgent to fix it.

As for me, I'm not able to do the pre-training due to lack of resources :_(