/ITR

Primary LanguagePython

Multimodal Pre-trained Framework for Aligning Image-Text Relation Semantics

Environment

  • python
  • numpy
  • torch
  • torchvision
  • transformers (pip install -r requirements.txt)

Data

Please refer this repository for the text-image relationship dataset and this repository for the Twitter100k dataset.

Pretrained Models/Embeddings

Download pretrained BERTWEET-Base from here and put it in this directory.

Download pretrained ViT from here, rename the binary file as "resnet101.pth" and put it in this directory.

Download pretrained Twitter Word Embedding from here and put it in this directory.

Download pretrained models by ours from here and put it in this directory.

Usage

Pretrain

python pretrain.py --cuda [GPU ID] --encoder [encoder name] --task_ids [task IDs] (--ocr)

Linear probe

python linear_probe.py --cuda [GPU ID] --encoder [encoder name] --task_ids [task IDs] (--ocr)

Finetune

python finetune.py --cuda [GPU ID] --encoder [encoder name] --task_ids [task IDs] (--ocr)

Result Analysis

statistic result

python statistic.py --eval [evaluation setting, e.g. fine-tune] --encoder [encoder name]

visualize for gradcam

python gradcam.py --encoder [encoder name]