HLTCHKUST/VG-GPLMs

`image_len` is not uesd?

Opened this issue · 5 comments

image_len=None,

image_len is not uesd in calculate attn?

image_len=None, means the default value is None, you can pass a int list wiht batch size to this function

I mean is "do not use image_len in calculate attn(as mask)"

And is some error in attn softmax dim?

attn = F.softmax(attn, dim=1)

attn shape is [batch_size(0), text_len(1), image_len(2)], should "softmax" in "image_len dim (2)"
So I think "the softmax dim should 2 not 1"?

is there something wrong with my thinking?

I see. The image_len is not used in the multimodal fusion function. You can put this as a mask in the cross-attention. Probably it can improve the performance slightly.

attn = F.softmax(attn, dim=1)

I think L882 should be 【reason see up
attn = F.softmax(attn, dim=2)

am I wrong?