/fake-news

Building a fake news detector from initial ideation to model deployment

Primary LanguageJupyter NotebookGNU Affero General Public License v3.0AGPL-3.0

Fake News Detector Powered By Machine Learning

A complete example of building an end-to-end machine learning project from initial idea to deployment.

This repo accompanies the blog post series describing how to build a fake news detection application. The posts included here:

  • Initial Setup and Tooling: Describes project ideation, setting up your repository, and initial project tooling.

  • Exploratory Data Analysis: Describes how to acquire a dataset and perform exploratory data analysis with tools like Pandas in order to better understand the problem.

  • Building a V1 Model Training/Testing Pipeline: Describes how to get a functional training/evaluation pipeline for the first ML model (a random-forest classifier), including how to properly test various parts of your pipeline.

  • Error Analysis and Model V2: Describes how to interpret what your first model has learned through feature analysis (via techniques like Shapley values) and error analysis. Also works toward a second model powered by Roberta.

  • Model Deployment and Continuous Integration: Describes how to deploy your model using FastAPI and Docker and build an accompanying Chrome extension. Also illustrates key components of a continuous integration system for collaborating on the application with other team members in a scalable and reproducible fashion.

Features

How to Use It

Go to the root directory of the repo and run:

pip install -r requirements.txt

Download the data from this link into data/raw.

You're ready to go!

Train

To train the random forest baseline, run the following from the root directory:

dvc repro train-random-forest

Your output should look something like the following:

INFO - 2021-01-21 21:26:49,779 - features.py - Creating featurizer from scratch...
INFO - 2021-01-21 21:26:49,781 - tree_based.py - Initializing model from scratch...
INFO - 2021-01-21 21:26:49,781 - train.py - Training model...
INFO - 2021-01-21 21:26:50,163 - features.py - Saving featurizer to disk...
INFO - 2021-01-21 21:26:50,169 - tree_based.py - Featurizing data from scratch...
INFO - 2021-01-21 21:26:59,360 - tree_based.py - Saving model to disk...
INFO - 2021-01-21 21:26:59,459 - train.py - Evaluating model...
INFO - 2021-01-21 21:26:59,584 - train.py - Val metrics: {'val f1': 0.7587628865979381, 'val accuracy': 0.7266355140186916, 'val auc': 0.8156070164865074, 'val true negative': 381, 'val false negative': 116, 'val false positive': 235, 'val true positive': 552}

Deploy

Once you have successfully trained a model using the step above, you should have a model checkpoint saved in model_checkpoints/random_forest.

Now build your deployment Docker image:

docker build . -f deploy/Dockerfile.serve -t fake-news-deploy

Once your image is built, you can run the model locally via a REST API with:

docker run -p 8000:80 -e MODEL_DIR="/home/fake-news/random_forest" -e MODULE_NAME="fake_news.server.main" fake-news-deploy

From here you can interact with the API using Postman or through a simple cURL request:

curl -X POST http://127.0.0.1:8000/api/predict-fakeness -d '{"text": "some example string"}'