FakeNews-Detector

📑 Table of Contents

Project Overview
Objective
Dataset
Setup and Environment
Jupyter Notebook Structure
Methodology
Experiments and Results
- Models without Preprocessing
- Models with Preprocessing
Best Model
Analysis
Conclusion
Future Work
References

📰 Project Overview

FakeNews-Detector is a Jupyter notebook-based machine learning project designed to tackle the pervasive issue of misinformation in digital media. By leveraging state-of-the-art natural language processing (NLP) techniques and various classification algorithms, this project aims to automatically distinguish between genuine and fabricated news articles with high accuracy.

🎯 Objective

The primary objectives of this project are:

To develop a robust model capable of accurately classifying news articles as either real or fake.
To compare the effectiveness of various machine learning algorithms in the context of fake news detection.
To evaluate the impact of text preprocessing on model performance.
To contribute to the ongoing efforts in combating misinformation and promoting media literacy.

📊 Dataset

The project utilizes the "Fake and Real News Dataset" from Kaggle:

Source: Fake and Real News Dataset
Structure:
- Two columns: 'Text' (news content) and 'label' (0 for Fake, 1 for Real)
Task Type: Binary Classification

🛠 Setup and Environment

To run the FakeNews-Detector Jupyter notebook:

Ensure you have Jupyter Notebook or JupyterLab installed.

Install required libraries:

pip install pandas scikit-learn spacy numpy

Download the spaCy English language model:
```
python -m spacy download en_core_web_sm
```
Download the dataset from Kaggle and place it in the same directory as the notebook.

📓 Jupyter Notebook Structure

The project is contained in a single Jupyter notebook, structured as follows:

Introduction and Setup
- Project overview
- Library imports
- Data loading
Data Preprocessing
- Text cleaning function
- Application of preprocessing
Feature Extraction
- Bag-of-N-grams implementation
Model Training and Evaluation
- KNN (Euclidean and Cosine)
- Random Forest
- Multinomial Naive Bayes
Results and Analysis
- Performance comparison
- Best model identification
Conclusion and Future Work

🛠 Methodology

Data Preprocessing

Two approaches were implemented:

Without Preprocessing: Raw text data used directly.
With Preprocessing:
- Removal of stop words
- Elimination of punctuation
- Lemmatization of words

Preprocessing function:

import spacy

nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    return " ".join(filtered_tokens)

Feature Extraction

The Bag-of-N-Grams approach was employed using scikit-learn's CountVectorizer.

Model Selection

Several classification algorithms were implemented and compared:

K-Nearest Neighbors (KNN)
- Euclidean distance metric
- Cosine similarity metric
Random Forest
Multinomial Naive Bayes

📈 Experiments and Results

Models without Preprocessing

1. KNN (Euclidean Distance)

              precision    recall  f1-score   support

           0       0.96      0.49      0.65      1000
           1       0.65      0.98      0.78       980

    accuracy                           0.73      1980
   macro avg       0.81      0.74      0.72      1980
weighted avg       0.81      0.73      0.72      1980

2. KNN (Cosine Similarity)

              precision    recall  f1-score   support

           0       0.99      0.55      0.71      1000
           1       0.69      1.00      0.81       980

    accuracy                           0.77      1980
   macro avg       0.84      0.77      0.76      1980
weighted avg       0.84      0.77      0.76      1980

3. Random Forest

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      1000
           1       0.99      1.00      0.99       980

    accuracy                           0.99      1980
   macro avg       0.99      0.99      0.99      1980
weighted avg       0.99      0.99      0.99      1980

4. Multinomial Naive Bayes

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1000
           1       0.99      0.98      0.99       980

    accuracy                           0.99      1980
   macro avg       0.99      0.99      0.99      1980
weighted avg       0.99      0.99      0.99      1980

Models with Preprocessing

1. Random Forest (Trigrams)

              precision    recall  f1-score   support

           0       0.93      0.98      0.96      1000
           1       0.98      0.93      0.95       980

    accuracy                           0.96      1980
   macro avg       0.96      0.95      0.95      1980
weighted avg       0.96      0.96      0.96      1980

2. Random Forest (1-3 grams)

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      1000
           1       1.00      0.99      1.00       980

    accuracy                           1.00      1980
   macro avg       1.00      1.00      1.00      1980
weighted avg       1.00      1.00      1.00      1980

🏆 Best Model

The best-performing model in our experiments was the Random Forest classifier with 1-3 grams and preprocessing. This model achieved perfect or near-perfect scores across all metrics:

Accuracy: 1.00 (100%)
Precision: 0.99 (Fake), 1.00 (Real)
Recall: 1.00 (Fake), 0.99 (Real)
F1-score: 1.00 (Fake), 1.00 (Real)

Implementation details:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('vectorizer_n_grams', CountVectorizer(ngram_range=(1, 3))),
    ('random_forest', RandomForestClassifier())
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

This model's exceptional performance can be attributed to:

The use of a wide range of n-grams (1-3), capturing both individual words and short phrases
The Random Forest algorithm's ability to handle high-dimensional data and capture complex relationships
Effective preprocessing, which removed noise and standardized the text data

🔍 Analysis

KNN Performance:
- Cosine similarity metric outperformed Euclidean distance
- Overall performance was lower compared to other algorithms, possibly due to the high-dimensional nature of text data
Random Forest:
- Demonstrated excellent performance across all configurations
- Showed robustness to different preprocessing approaches
- The combination of multiple decision trees likely contributed to its ability to capture complex patterns in the text data
Multinomial Naive Bayes:
- Performed very well without preprocessing
- Its strong performance aligns with its reputation as an effective algorithm for text classification tasks
Effect of Preprocessing:
- Generally maintained or slightly improved model performance
- The slight performance reduction in some cases (e.g., Random Forest with trigrams) suggests that some informative features might have been lost during preprocessing
N-gram Impact:
- The use of 1-3 grams consistently outperformed models using only trigrams, indicating the importance of capturing both individual words and short phrases

🚀 Conclusion

The FakeNews-Detector project successfully demonstrates the effectiveness of machine learning techniques in distinguishing between real and fake news articles. Key findings include:

Random Forest emerged as the most effective algorithm for this task, achieving near-perfect accuracy.
The combination of preprocessing and a wide range of n-grams (1-3) yielded the best results.
While preprocessing generally improved performance, its impact varied across different models and configurations.
The high accuracy achieved by multiple models suggests that lexical features (captured by bag-of-n-grams) are strongly indicative of fake news in this dataset.

🔮 Future Work

Experiment with more advanced NLP techniques:
- Word embeddings (e.g., Word2Vec, GloVe)
- Transformer-based models (e.g., BERT, RoBERTa)
Incorporate additional features:
- Source credibility scores
- Publication date and time
- Author information
Develop a web-based interface for real-time fake news detection
Explore ensemble methods to combine the strengths of different models
Investigate the model's performance on different types of fake news (e.g., satire, propaganda)
Conduct error analysis to understand the types of articles that are misclassified

📚 References

Kaggle Dataset: Fake and Real News Dataset
scikit-learn Documentation: https://scikit-learn.org/
spaCy Documentation: https://spacy.io/
Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22-36.
Allcott, H., & Gentzkow, M. (2017). Social Media and Fake News in the 2016 Election. Journal of Economic Perspectives, 31(2), 211-236.

FYT3RP4TIL/Fake-News-Detection-V2-N-Grams