This project aims to detect fake news
with Machine Learning and Deep Learning. In this project, I am using two datasets: one is from Huggingface, and the other is from Kaggle. I am concatenating these datasets to get a more comprehensive dataset for training and testing the models. I will be using the CountVectorizer
method to convert the text data to vectors for machine learning models. Then I will be using the Tokenizer
method to tokenize and pad the data for deep learning models. Finally, I will be comparing the accuracy of various machine learning and deep learning models to identify the best model for fake news detection.
Data
: The data used for my training was saved here.Model
: The model with the highest performance metrics was saved in this foldr.Notebook
: This folder contains the Jupyter notebooks used for training, testing, and analyzing the model.
I used two datasets to train and test the models.
-
NoahGift/fake-news dataset
: This dataset contains 2,096 articles. The data is in CSV format and had multiple columns but will be making use of only this four columns:title_without_stopwords
,text_without_stopwords
,type
, andlabel
. I have used this dataset as one of the sources for my training and testing data. -
Fake and Real News Dataset
: This dataset contains two CSV files, one with fake news and the other with real news. I have used both dataset for our training and testing data byconcantenating
. This dataset has four columns:title
,text
,subject
, anddate
.
I have done the following pre-processing steps before training and testing different models:
- Concatenated the two datasets to get a more comprehensive dataset.
- Removed the missing values from the dataset.
- Combined the title and text columns of the dataset.
- Converted the text data to lowercase.
- Tokenized the text data.
- Removed the stopwords from the text data.
- Lemmatized the tokens.
- Removed the punctuation marks.
I used both TFIDFVectorizer
and CountVectorizer
method to convert the text data to vecors for ML models. CountVectorizer
had the best performance metrics compared to TFIDFVectorizer
and I used the following machine learning models to train and test the dataset:
- Logistic Regression
- Random Forest
- Naive Bayes
- Gradient Boosting Classifier
- Decision Tree Classifier
- Extreme Gradient Boosting
After training and testing different models, I found that the Extreme Gradient Boosting Classifier
has the highest accuracy of 98.83%
.
Since I am working with text data, I had to tokenize and pad the data using the Tokenizer
and pad_sequences
method before building the deep learning model. Afterwards, I trained and tested the dataset using the following models:
- Bi-Directional LSTM
- Bi-Directional GRU (
Gated Recurrent Units (GRUs)
)
After training and testing the model, I found that the model with the highest accuracy of 99% was Bi-Directional LSTM
.
In conclusion, I have shown that both machine learning and deep learning models can be used to detect fake news. I found that the Extreme Gradient Boosting Classifier
has the highest accuracy among the machine learning models, while the Bi-Directional LSTM
has the highest accuracy among the deep learning models. These models can be used in the future to detect fake news and prevent the spread of false information. This is a Personalised Project