/Fake-News-Project-1

Fake News Project for CDIPS 2017

Primary LanguageJupyter Notebook

Identifying Fake News

Project Description

The increased use of the Internet or social medias to share news has allowed information to travel at record speeds. However, it has also led to the rise of fake news stories, a recent phenomenon that relies on the ability of an article to go "viral" without being vetted by an editorial team, as in traditional news sources. This project seeks to identify fake or highly-biased news articles to help prevent the spread of false information. More specially, we implemented a program to examine the existence of authors, words and punctuation usage in titles, and article bodies, and used machine learning algorithms to identify news from unreliable sources.

Building Dataset

Our final dataset (balanced_data.csv under data directory) contains 1473 articles from reliable sources and 1473 from unreliable sources. Items from reliable sources have attribute authenticity as 0, while entities from ureliable sources have authenticity of 1.

1. Collect Data

Our data collection scripts:

  • Step1_CollectData_Hyesoo.ipynb (Hyesoo's script)
  • Step1_CollectData_Jinmei_FakeNews.ipynb and Step1_CollectData_Jinmei_RealNews.ipynb (Jinmei's script')

used the Python library Newspaper that collects information of articles from a wide variety of news sources.

We have collected news from 12 reliable news sources and 42 unreliable news websites.

  • The reliable sources include msnbc, nbcnews, politico, foxnews, nytimes, reuters, abc, bbc, cnn, newyorker, cbsnews, and npr.
  • The unreliable news sources we used are 24wpn, beforeitsnews, readconservatives, newsbbc, now8news, americanfreepress, nephef, nationonenews, infostormer, Conservativedailypost, donaldtrumppotus45, ladylibertysnews, interestingdailynews, president45donaldtrump, openmagazines, krbcnews, bizstandardnews, bipartisanreport, local31news, nbcnews, CivicTribune, politicono, redcountry, AmericanFlavor, ddsnewstrend, Clashdaily, realnewsrightnow, wordpress, reagancoalition, lastdeplorables, Americannews, aurora-news, thedcgazette, politicalo, newswithviews, pamelageller, Bighairynews, ABCnews, sputniknews, prntly, Americanoverlook, and majorthoughts.

2. Clean Data

Data collected with Step1_CollectData_Hyesoo.ipynb use the script, Step2_CleanData_Hyesoo.ipynb, to remove short articles and errors.

3. Merge Data

Data collected by Hyesoo and Jinmei are merged with the script, Step3_MergeData.ipynb. It also includes cleaning procedure for data collected by Jinmei.

Feature engineering

Features include

  • existence of authors
  • exaggerating punctuations used in titles
  • rate of uppercases used in titles
  • TF-IDF values generated with text body

The first three features were generated using Step4_GenerateExtraFeatures.ipynb script. They were rescaled with TF-IDF features afterwards.

Predicting fake news with various machine learning classifiers

Models used include

  • Naive Bayes
  • Logistic Regression
  • Neural Network (MLPclassifier)
  • Random Forest
  • Support Vector Machine

See script Step5_ExtractFeatures_Predict_w_MachineLearning.ipynb for details.

Results

We are able to obtain > 92 % prediction accuracy within our dataset.

User Interface

We have developed a simple user interface that predicts the authenticity of an article of interest. In detail, running the file 'run_final.py' in flask_api folder will generate a local http address, in which the user can submit the url of an article, and the model predicts the article's authenticity.

References

Lists of fake news websites: