MILVLG/mcan-vqa

linear fusion model

clytze0216 opened this issue · 1 comments

Thank you for sharing.I would like to ask if you have tried to change the linear multimodal fusion model, does it affect the accuracy?Looking forward to your reply. Thanks a lot!!

We have tested some other models like eltwise-prod, concat, and bilinear pooling models. However, these do not bring any improvements. We think the reason could be that the multimodal fusion has been conducted implicitly in the deep co-attention learning stage.

Since the modification is minor, you can try it by yourself, if any new results are found, we will appreciate it if you can tell us.