/Med-MMHL

Med-MMHL dataset

Primary LanguagePython

Med-MMHL

This is the repository of the dataset corresponding to the article Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain. The data can be found at here.

Dataset Description

The data are already split into train/dev/test sets.

Below tables summarize the task and its source path, where the statistics are in Tab 2 of our paper.

Task Benchmarked Results Data Location
Fake news
detection
Tab 3 in paper fakenews_article
LLM-generated fake sent detection Tab 3 in paper sentence
Multimodal fake news detection Tab 3 in paper image_article
Fake tweet detection Tab 4 in paper fakenews_tweet
Multimodal tweet detection Tab 4 in paper image_tweet

For multimodal tasks, the paths to the images are stored in the column image. The path looks like /images/2023-05-09_fakenews/LeadStories/551_32.png for news. You do not need to modify the path of the images folder in the root directory of your project.

The content and images of tweets can be crawled with the code collect_by_tweetid_tweepy_clean.py or other legal twitter extraction tool given tweet IDs.

Enviroment Configure

conda create -f clip_env.yaml

conda activate clip_env

Running Baselines

Most of our baselines are drawn from Hugging Face, so you need to provide the name of the models to make the code run. The Hugging Face models included in our baseline experiments are listed below.

Model Name Hugging Face Name
BERT bert-base-cased
BioBERT pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb
Funnel Transformer funnel-transformer/medium-base
FN-BERT ungjus/Fake_News_BERT_Classifier
SentenceBERT sentence-transformers/all-MiniLM-L6-v2
DistilBERT sentence-transformers/msmarco-distilbert-base-tas-b
CLIP openai/clip-vit-base-patch32
VisualBERT uclanlp/visualbert-vqa-coco-pre

Below are some examples of training and testing the Hugging Face models. Please refer to the code to explore more editable arguments.

To train a fine-tuned version of bioBERT, the command looks like this:

python fake_news_detection_main.py \
    -bert-type pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb \
    -device 0 \
    -batch-size 4 \
    -benchmark-path path/to/your/data \
    -dataset-type fakenews_article

To test an existing model, the command is:

python fake_news_detection_main.py \
    -bert-type pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb \
    -device 0 \
    -batch-size 4 \
    -benchmark-path path/to/your/data \
    -dataset-type fakenews_article \
    -snapshot path/to/your/model \
    -test

Similarly, to train and test a multimodal model, the commands are:

python fake_news_detection_multimodal_main.py \
    -clip-type uclanlp/visualbert-vqa-coco-pre \
    -device 0 \
    -batch-size 4 \
    -benchmark-path path/to/your/data \
    -dataset-type image_article

and

python fake_news_detection_multimodal_main.py \
    -clip-type uclanlp/visualbert-vqa-coco-pre \
    -device 0 \
    -batch-size 4 \
    -benchmark-path path/to/your/data \
    -dataset-type image_article \
    -snapshot path/to/your/model \
    -test

If you find the dataset is helpful, please cite

@article{sun2023med,
  title={Med-MMHL: A Multi-Modal Dataset for Detecting Human-and LLM-Generated Misinformation in the Medical Domain},
  author={Sun, Yanshen and He, Jianfeng and Lei, Shuo and Cui, Limeng and Lu, Chang-Tien},
  journal={arXiv preprint arXiv:2306.08871},
  year={2023}
}

or

Sun, Yanshen, et al. "Med-MMHL: A Multi-Modal Dataset for Detecting Human-and LLM-Generated Misinformation in the Medical Domain." arXiv preprint arXiv:2306.08871 (2023).