Fake news is articles that contains misleading information aiming to change other’s opinion, thus gaining power (political, business, etc.). In this study, I propose a machine learning model based on Naive Bayesand implemented in PySpark for classifying document into two groups of news: reliable and fake. Data cleaning,stop words removing, and counting terms frequency were all implemented to generate the training and test datasets. Results of the ML model were compared to the baseline using confusion matrix, and revealed a great improvement in accuracy and F1 score.
-
NLTK
import nltk from nltk.stem import PorterStemmer from nltk.corpus import stopwords
-
PySpark
from pyspark.sql import SparkSession from pyspark.sql.types import StructType,StructField, StringType, IntegerType from pyspark.ml.feature import IDF, Tokenizer, VectorAssembler from pyspark.ml.feature import StopWordsRemover, CountVectorizer from pyspark.ml import Pipeline, PipelineModel from pyspark.sql.functions import when, col, regexp_replace, concat, lit, length from pyspark.sql.types import FloatType, DoubleType from pyspark.ml.classification import NaiveBayesModel, NaiveBayes from pyspark.mllib.evaluation import BinaryClassificationMetrics
-
Others
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt
- evaluate(df, labelCol = "label", predCol = "prediction")
Compute precision, accuracy, F1 score, and recall. Print them as well as the confusion matrix, and return some of them as a tuple: (confusion_matrix, precision, recall)
class Stemmer(Transformer, HasInputCol, HasOutputCol, DefaultParamsReadable, DefaultParamsWritable):
Convert every word in the list to its stem using the NLTK PorterStem instance, thus reducing the dimension of the features column. For example, the words "Playing", "Plays", and "Played" are all converted to "Play".
Running on Google Collaboratoy: Upload Final Project.ipynb file to google drive, launch it using Google Collaboratoy, and run the code.
Any other platform:
Clone the repo using the following command in terminal:
git clone https://github.com/avivfaraj/DSCI631-project.git
Upload Final Project.ipynb and the dataset to the platform of your choice. Before running the code, make sure to change the path to dataset.
Dataset was found at Kaggle.