Fake News Detector Powered By Machine Learning

A complete example of building an end-to-end machine learning project from initial idea to deployment.

This repo accompanies the blog post series describing how to build a fake news detection application. The posts included here:

Initial Setup and Tooling: Describes project ideation, setting up your repository, and initial project tooling.
Exploratory Data Analysis: Describes how to acquire a dataset and perform exploratory data analysis with tools like Pandas in order to better understand the problem.
Building a V1 Model Training/Testing Pipeline: Describes how to get a functional training/evaluation pipeline for the first ML model (a random-forest classifier), including how to properly test various parts of your pipeline.
Error Analysis and Model V2: Describes how to interpret what your first model has learned through feature analysis (via techniques like Shapley values) and error analysis. Also works toward a second model powered by Roberta.
Model Deployment and Continuous Integration: Describes how to deploy your model using FastAPI and Docker and build an accompanying Chrome extension. Also illustrates key components of a continuous integration system for collaborating on the application with other team members in a scalable and reproducible fashion.

Features

Random forest classifier powered by Scikit-learn.
RoBERTa model powered by HuggingFace Transformers and PyTorch Lightning.
Data versioning and configurable train/test pipelines using DVC.
Exploratory data analysis using Pandas.
Experiment tracking and logging via MLFlow.
Continuous integration with Github actions.
Functionality tests powered by PyTest and Great Expectations.
Error and model feature analysis via SHAP.
Production-ready server via FastAPI and Gunicorn.
Chrome extension for interacting with a model in the browser.

How to Use It

Go to the root directory of the repo and run:

pip install -r requirements.txt

Download the data from this link into data/raw.

You're ready to go!

Train

To train the random forest baseline, run the following from the root directory:

dvc repro train-random-forest

Your output should look something like the following:

INFO - 2021-01-21 21:26:49,779 - features.py - Creating featurizer from scratch...
INFO - 2021-01-21 21:26:49,781 - tree_based.py - Initializing model from scratch...
INFO - 2021-01-21 21:26:49,781 - train.py - Training model...
INFO - 2021-01-21 21:26:50,163 - features.py - Saving featurizer to disk...
INFO - 2021-01-21 21:26:50,169 - tree_based.py - Featurizing data from scratch...
INFO - 2021-01-21 21:26:59,360 - tree_based.py - Saving model to disk...
INFO - 2021-01-21 21:26:59,459 - train.py - Evaluating model...
INFO - 2021-01-21 21:26:59,584 - train.py - Val metrics: {'val f1': 0.7587628865979381, 'val accuracy': 0.7266355140186916, 'val auc': 0.8156070164865074, 'val true negative': 381, 'val false negative': 116, 'val false positive': 235, 'val true positive': 552}

Deploy

Once you have successfully trained a model using the step above, you should have a model checkpoint saved in model_checkpoints/random_forest.

Now build your deployment Docker image:

docker build . -f deploy/Dockerfile.serve -t fake-news-deploy

Once your image is built, you can run the model locally via a REST API with:

docker run -p 8000:80 -e MODEL_DIR="/home/fake-news/random_forest" -e MODULE_NAME="fake_news.server.main" fake-news-deploy

From here you can interact with the API using Postman or through a simple cURL request:

curl -X POST http://127.0.0.1:8000/api/predict-fakeness -d '{"text": "some example string"}'

parulnith/fake-news

Fake News Detector Powered By Machine Learning

Features

How to Use It

Train

Deploy