Multimodal Pre-trained Framework for Aligning Image-Text Relation Semantics

Environment

Please refer this repository for the text-image relationship dataset and this repository for the Twitter100k dataset.

Download pretrained BERTWEET-Base from here and put it in this directory.

Download pretrained ViT from here, rename the binary file as "resnet101.pth" and put it in this directory.

Download pretrained Twitter Word Embedding from here and put it in this directory.

Download pretrained models by ours from here and put it in this directory.

python pretrain.py --cuda [GPU ID] --encoder [encoder name] --task_ids [task IDs] (--ocr)

python linear_probe.py --cuda [GPU ID] --encoder [encoder name] --task_ids [task IDs] (--ocr)

python finetune.py --cuda [GPU ID] --encoder [encoder name] --task_ids [task IDs] (--ocr)

python statistic.py --eval [evaluation setting, e.g. fine-tune] --encoder [encoder name]

python gradcam.py --encoder [encoder name]