This is the repo for "MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets" accepted at Findings of EMNLP '21.
setting up dependencies
if CUDA_version == "10.0":
torch_version_suffix = "+cu100"
elif CUDA_version == "10.1":
torch_version_suffix = "+cu101"
elif CUDA_version == "10.2":
torch_version_suffix = ""
else:
torch_version_suffix = "+cu110"
For installing CLIP
! pip3 install torch==1.7.1{torch_version_suffix} torchvision==0.8.2{torch_version_suffix} -f https://download.pytorch.org/whl/torch_stable.html ftfy regex --user
! wget https://openaipublic.azureedge.net/clip/bpe_simple_vocab_16e6.txt.gz -O bpe_simple_vocab_16e6.txt.gz
For sentence transformer: Follow steps from https://github.com/UKPLab/sentence-transformers
The .py contains the exhaustive set of steps required to be run in sequence.
- It contains code for loading pre-saved ROI and entity features, which can be loaded if available.
- Otherwise the code for extracting features on-demand is also included.
- For initializing dataset and data loader for pytorch: Load the data-set for training and testing as per the requirement of the run.
- Experimental settings:
Configurations for the binary/multi-class setting (training/testing/evaluation) has to be considered as per the requirement, code blocks for which are provided and suitably commented out.
Please note: TWO versions of Harm-P data for "Harmfulness" are provided as part of this repo -- HarMeme-V0 (has duplicates in Harm-P) and HarMeme-V1 (completed set for Harm-P), respectively. We recommend using HarMeme-V1 for updated and correct version for "Harmfulness" data for US Politics category (both V0 and V1 contain original-ReadyToUse-data for Harm-C, which has Covid-19 category. While "Target" data for both categories can be found as part of HarMeme-V0 link given below.
- HarMeme Images
- HarMeme-V0: CAUTION! OBSOLETE FOR HARM-P "Harmfulness" - Contains duplicates in Harm-P. See the upgraded version (V1) below for the deduplicated version of Harm-P (Harmfulness) data. HarMeme-V0 content (including Target data) can be accessed via the following links:
- HarMeme-V0 data files (Harmfulness + Target) - Contains duplicates for US Politics (Harmfulness)
- Entity features, ROI features, ROI + Entity features
- HarMeme-V1: Updated + Complete Version (for "Harmfulness"). For additional details about HarMeme-V1, refer the README in "HarMeme_V1" folder of this repo. Contents of "HarMeme_V1":
- Annotations (Same format as V0: [id, image, labels, text]) - Duplicates Removed.
- Meta-info (Collected using GCV API): Meme id, OCR Text, Web Entities, Best labels, Titles, Objects, ROI Info.
Acknowledgement: Thanks to mingshanhee and uprihtness for pointing out the discrepancies.