This is our group's project for Project 4, where we were tasked to solve, analyze, or visualize a problem using machine learning (ML) with the other technologies we've learned. We chose to tackle the problem of identifying fake news using ML.
Fake news is a growing problem in today's society, and it can have serious consequences, including influencing public opinion and even shaping policy decisions. In this project, we aim to build a machine learning model that can accurately identify whether a given news article is fake or not.We utilized the following technologies in our project:
- Scikit-learn for machine learning
- Python Pandas for data manipulation
- Python Matplotlib for data visualization
- HTML/CSS/Bootstrap for frontend web development
- JavaScript Plotly for data visualization
- SQL Database for data storage and retrieval
Our project follows the technical requirements outlined in the assignment brief, including:
- Implementing a Python script to initialize, train, and evaluate our model
- Cleaning, normalizing, and standardizing our data prior to modeling
- Utilizing data retrieved from SQL or Spark
- Demonstrating meaningful predictive power with at least 75% classification accuracy or 0.80 R-squared
- Documenting our model optimization and evaluation process with iterative changes made to the model and the resulting changes in - - model performance in either a CSV/Excel table or in the Python script itself
- Printing or displaying overall model performance at the end of the script
- Maintaining a GitHub repository that is free of unnecessary files and folders and has an appropriate .gitignore in use
- Customizing the README as a polished presentation of the content of the project
- Johan Snyman
- Jon Wood
The dataset used in this project consists of 20,000+ news articles labeled as either fake or real. We used 5000 of these for this project. The dataset is split into training and test sets to train and evaluate the machine learning models.
https://www.kaggle.com/c/fake-news/data
The preprocessing steps include:
Lowercasing the text
Tokenization
Removing stopwords
Stemming using the PorterStemmer algorithm
Vectorization
The text data is transformed into numerical features using the CountVectorizer class from the sklearn library. This is done to enable machine learning algorithms to process the text data.
Two classification models are used to identify fake news articles:
Logistic Regression
Passive Aggressive Classifier
Hyperparameter tuning is performed using GridSearchCV to optimize the models for the best performance.
The performance of each model is evaluated using the following metrics:
Prediction Accuracy
Classification Report
Confusion Matrix
We will be giving a group presentation on our project, where all members will speak and present the project's content, transitions, and conclusions smoothly within the allotted time. We will also ensure that the presentation maintains audience interest throughout.
Dataset
|-- source-info.txt
|-- test.csv
|-- train.csv
ETL
|-- ETL_Project_4.ipynb
|-- Schema
| |-- Project 4 - DBD.png
| |-- Project_4 Database Schema.sql
| |-- QuickDBD-Project 4.pdf
|-- cleanup.txt
ML
|-- FAKE-NEWS-ML.ipynb
|-- api_keys.py
|-- logreg_model_optimization.png
|-- model_optimization.png
Pickles
|-- logisticreg_model.pkl
|-- passive_aggressive_model.pkl
|-- passiveagressive_model.pkl
|-- tfidf_vectorizer.pkl
|-- tfidfvect2.pkl
Presentation
|-- Project-4-Presentation.pdf
Proposal
|-- Project 4 - A-TEAM - Proposal.pdf
Static
|-- Images
| |-- fake-new-facts.jpeg
| |-- Project 4 outline.png
| |-- fake.png
| |-- johan.jpg
| |-- jono.jpg
| |-- logreg-results.jpg
| |-- passive-aggressive-results.jpg
| |-- true.png
|-- css
| |-- project4_style.css
templates
|-- index.html
README.md
app.py
requirements.txt
To run the web application, follow these steps:
- Clone the repository
- Install the required Python libraries using
pip install -r requirements.txt
- Run
python app.py
in the terminal - Open the web application in the browser at
http://localhost:5000
- Improve the accuracy of the model by using more advanced techniques, such as deep learning
- Increase the size of the dataset to improve the generalization of the model
- Deploy the web application on a cloud platform, such as Google Cloud or Amazon AWS
We would like to thank our bootcamp instructors for their guidance and support throughout this project. We would also like to thank the creators of the dataset used in this project.
- Scikit-learn documentation: https://scikit-learn.org/stable/documentation.html
- Bootstrap documentation: https://getbootstrap.com/docs/5.1/getting-started/introduction/
- Plotly documentation: https://plotly.com/javascript/
- Leaflet documentation: https://leafletjs.com/