weight w_i^j in Token-wise Cross-modal Alignment

Hello! In your paper, there is a weight in Token-wise Cross-modal Alignment objective, but I cannot find the corresponding code/implementation in this repo, this confused me a lot. Thanks!

Hello,

The corresponding weights are implemented in:

MGCA/mgca/models/mgca/mgca_module.py

Line 153 in b9ec84f

word_atten_weights /= word_atten_weights.sum(dim=1, keepdims=True)

and

MGCA/mgca/models/mgca/mgca_module.py

Line 204 in b9ec84f

patch_atten_weights /= patch_atten_weights.sum(

Thanks!

Thanks a lot! Another question is about covidx dataset. You said you "used the version 6 of COVIDx dataset", but i cannot find/download the corresponding one. And I try "COVIDx CXR-2" and "train_COVIDx9A.txt" as this dataset, but there are some imgs in train_COVIDx9A.txt but not in CXR-2 dataset thus cannot reproduce. Can you give me more specific details/links about this dataset? Thanks!

Sorry but another question is: whether you pretrain MGCA twice (img backbone is Vit and one is ResNet) and only use the Resnet one to Detection and Segmentation tasks? Can Vit one be used for Detection and Segmentation tasks or if Detection and Segmentation tasks must use the ResNet one? Thanks!

Thanks a lot! Another question is about covidx dataset. You said you "used the version 6 of COVIDx dataset", but i cannot find/download the corresponding one. And I try "COVIDx CXR-2" and "train_COVIDx9A.txt" as this dataset, but there are some imgs in train_COVIDx9A.txt but not in CXR-2 dataset thus cannot reproduce. Can you give me more specific details/links about this dataset? Thanks!

I think you download the latest version 7 on the kaggle page (you can see this label on the right), which is associated with the updated-COVIDx9A, some images are removed in this version as noted. You should download the version 6 on the kaggle page (it shows as "Metadata Updated") in which those images are not removed. Hope it helps.

Sorry but another question is: whether you pretrain MGCA twice (img backbone is Vit and one is ResNet) and only use the Resnet one to Detection and Segmentation tasks? Can Vit one be used for Detection and Segmentation tasks or if Detection and Segmentation tasks must use the ResNet one? Thanks!

Yes! It might also be possible. But at that point we thought that CNN-based detection or segmentation approaches performs better than VIT-based methods in chest X-ray domain.