FakeHealth repository is to supplement the paper "Ginger Cannot Cure Cancer: Battling Fake Health News with a Comprehensive Data Repository". This repository (FakeHealth) is collected to address challenges in Fake Health News detection, which includes news contents, news reviews, social engagements and user network.
Our repository consist of two datasets: HealthStory and HealthRelease. Due to the twitter policy of protecting user privacy, the fullcontents of user social engagements and network are not al-lowed to directly publish. Instead, we store the IDs of all social engagements and related user network into json files, and supplement them with a API to trivially attain the social engagements and user network from twitter. The IDs are stored in ./dataset/engagements/HealthRelease.json
, ./dataset/engagements/HealthStory.json
, ./dataset/user_network/followers/
, and ./dataset/user_network/following/
. Due to the size limitation, the IDs of followers and following is uploaded to zenodo as version 2 of FakehHealth.
- twython==3.7.0
- Developer APP of twitter to generate
app_key
,app_secret
,oauth_token
andoauth_token_secret
-
set the
.\API\resources\tweet_keys_file.txt
in the format of:app_key,app_secret,oauth_token,oauth_token_secret XXXXXX,XXXXXXX,XXXXXXXXX,XXXXXXXXXXXXX
-
Build HealthStory:
python main.py news_type=HealthStory sav_dir=../dataset
-
Build HealthRelease:
python main.py news_type=HealthRelease sav_dir=../dataset
-
Build user network:
-
Download the
dataset/user_network/followers
anddataset/user_network/followering
fromhttps://zenodo.org/record/3606756
. -
(optional) Collect the followers and followerings profiles and save it into
dataset/user_network/user_profiles
:python crawl_friends_profiles.py sav_dir=../dataset
Note that the number of friends are extremely large. We only recommend you crawl the friends profiles if it is necessary.
-
The data provided here only cantain the The downloaded dataset will have the following folder structure,
- content
- HealthStory
- <news_id>.json: a list of news contents wich include URL, Title, Key words, Tags, Image URL, Author and Publishing Date.
- HealthRelease.json: ~
- HealthStory
- reviews
- HealthStory.json: a list of news reviews which include Rating, news source,description, summary of the review, ground truth labels of the ten standard criteria, explanations of the criteria judgements and image link.
- HealthRelease.json: ~
- engagements
- HealthStory
- <news_id>
- tweets
- <ID>.json: The json file of the tweet object. The detailed attributes of tweet object is here.
- ......
- retweets
- <ID>.json
- ......
- replies
- <ID>.json
- tweets
- HealthRelase
- ......
- <news_id>
- HealthStory
- user_network
- user_profiles
- <user_name>.json: The json file of the user profile object. The detailed attributes of user profile object is here
- ......
- user_timelines
- <user_name>.json: a list of tweet objects
- ......
- user_followers
- <user_name>.josn: a list of user follower IDs (up to 200 per user)
- ......
- user_following
- <user_name>.json: a list of user following IDs (up to 5000 per user)
- ......
- user_profiles
If you use the FakeHealth datasets, please cite the following paper:
@article{dai2020ginger,
title={Ginger Cannot Cure Cancer: Battling Fake Health News with a Comprehensive Data Repository},
author={Dai, Enyan and Sun, Yiwei and Wang, Suhang},
journal={arXiv preprint arXiv:2002.00837},
year={2020}
}