This is the repository of the dataset corresponding to the article Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain. The data can be found at here.
The data are already split into train/dev/test sets.
Below tables summarize the task and its source path, where the statistics are in Tab 2 of our paper.
Task | Benchmarked Results | Data Location |
---|---|---|
Fake news detection |
Tab 3 in paper | fakenews_article |
LLM-generated fake sent detection | Tab 3 in paper | sentence |
Multimodal fake news detection | Tab 3 in paper | image_article |
Fake tweet detection | Tab 4 in paper | fakenews_tweet |
Multimodal tweet detection | Tab 4 in paper | image_tweet |
For multimodal tasks, the paths to the images are stored in the column image. The path looks like /images/2023-05-09_fakenews/LeadStories/551_32.png for news. You do not need to modify the path of the images folder in the root directory of your project.
The content and images of tweets can be crawled with the code collect_by_tweetid_tweepy_clean.py or other legal twitter extraction tool given tweet IDs.
conda create -f clip_env.yaml
conda activate clip_env
Most of our baselines are drawn from Hugging Face, so you need to provide the name of the models to make the code run. The Hugging Face models included in our baseline experiments are listed below.
Model Name | Hugging Face Name |
---|---|
BERT | bert-base-cased |
BioBERT | pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb |
Funnel Transformer | funnel-transformer/medium-base |
FN-BERT | ungjus/Fake_News_BERT_Classifier |
SentenceBERT | sentence-transformers/all-MiniLM-L6-v2 |
DistilBERT | sentence-transformers/msmarco-distilbert-base-tas-b |
CLIP | openai/clip-vit-base-patch32 |
VisualBERT | uclanlp/visualbert-vqa-coco-pre |
Below are some examples of training and testing the Hugging Face models. Please refer to the code to explore more editable arguments.
To train a fine-tuned version of bioBERT, the command looks like this:
python fake_news_detection_main.py \
-bert-type pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb \
-device 0 \
-batch-size 4 \
-benchmark-path path/to/your/data \
-dataset-type fakenews_article
To test an existing model, the command is:
python fake_news_detection_main.py \
-bert-type pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb \
-device 0 \
-batch-size 4 \
-benchmark-path path/to/your/data \
-dataset-type fakenews_article \
-snapshot path/to/your/model \
-test
Similarly, to train and test a multimodal model, the commands are:
python fake_news_detection_multimodal_main.py \
-clip-type uclanlp/visualbert-vqa-coco-pre \
-device 0 \
-batch-size 4 \
-benchmark-path path/to/your/data \
-dataset-type image_article
and
python fake_news_detection_multimodal_main.py \
-clip-type uclanlp/visualbert-vqa-coco-pre \
-device 0 \
-batch-size 4 \
-benchmark-path path/to/your/data \
-dataset-type image_article \
-snapshot path/to/your/model \
-test
If you find the dataset is helpful, please cite
@article{sun2023med,
title={Med-MMHL: A Multi-Modal Dataset for Detecting Human-and LLM-Generated Misinformation in the Medical Domain},
author={Sun, Yanshen and He, Jianfeng and Lei, Shuo and Cui, Limeng and Lu, Chang-Tien},
journal={arXiv preprint arXiv:2306.08871},
year={2023}
}
or
Sun, Yanshen, et al. "Med-MMHL: A Multi-Modal Dataset for Detecting Human-and LLM-Generated Misinformation in the Medical Domain." arXiv preprint arXiv:2306.08871 (2023).