The internet term for this type of misleading fake news is “clickbait” — headlines that catch a reader’s attention to make them click on the fake news. This type of fake news is misleading at best and untrue at worst. In this project, I have extracted many interesting patterns from the headline text usining natural language processing and perform exploratory data analysis to provide useful insight about fake or clickbait headlines by creating intuitive features.
In this project, I have extracted interesting patterns from the headline text and perform exploratory data analysis to provide useful insight about fake or clickbait headlines by creating intuitive features. This project includes work detailed below:
- Researching exploratory data analysis and feature eningering
- Importing data from tex based features
- Cleaning data from test base features
- Created machine Learning modeling using text based features
- Data visulaiation using text based features
The dataset used in this project is the Fake News Dataset from Kaggle. The dataset contains two types of articles fake and real News. This dataset was collected from realworld sources; the truthful articles were obtained by crawling articles from Reuters.com (News website). As for the fake news articles, they were collected from different sources. The fake news articles were collected from unreliable websites that were flagged by Politifact (a fact-checking organization in the USA) and Wikipedia. The dataset contains different types of articles on different topics, however, the majority of articles focus on political and World news topics.
The dataset consists of two CSV files. The first file named True.csv
contains more than 12,600 articles from reuter.com. The second file named Fake.csv
contains more than 12,600 articles from different fake news outlet resources. Each article contains the following information:
- article title (News Headline),
- text,
- type (REAL or FAKE)
- the date the article was published on*
In wordcloud most frequent occuring words in the corpus to be shown. The size of words in the wordcloud is based on their frequency in the corpus.The more the word appears, the largers the word font will be. As we can see from the wordclouds most frequent words in fake news are Video,Obama, Hillary, Trump and Republican whereas Real news comprise Trump, White House, North Korea, China etc.
Following results achieved in this project using different modeling approaches.
Model | Accuracy | Precision | Recall | F1- Score |
---|---|---|---|---|
Multinomial Naive Baye's TFIDF (Bi-gram) | 94.10% | 91.95%% | 97.17% | 94.48% |
Passive Aggressive Classifier TFID (Bi-gram) | 95.9% | 95.53% | 96.73% | 96.12% |
Logistic Regression TFID (Bi-gram) | 94.6% | 94.63%% | 94.95% | 94.78% |
LSTM with GLOVE embedding | 94.69% | 95.00%% | 95.00% | 95.00% |
BERT (1 epoch) | 98.43% | 98.00% | 98.00% | 98.00% |
This project requires Python 3.7x and the following Python libraries should be installed to get the project started:
I also reccommend to install Anaconda, a pre-packaged Python distribution that contains all of the necessary libraries and software for this project which also include jupyter notebook to run and execute IPython Notebook.
In a terminal or command window, navigate to the top-level project directory fake-news-classifier/ (that contains this README) and run one of the following commands:
ipython notebook Fake_news_preprocessing.ipynb
or
ipython notebook fake news Analysis.ipynb
or
ipython notebook fake news headline LSTM.ipynb
or
ipython notebook fake_news_classification_machine_learning_approach.ipynb
or
ipython notebook fake_news_classification_using_BERT.ipynb
This will open the Jupyter Notebook software and project file in your browser.