
Final project for fourth course of Udacity's MLOPS Nanodegree "4. Deploying a Scalable ML Pipeline in Production"

Deploying a Machine Learning Model on Heroku with FastAPI


This repository contains my submission for the course Deploying a Scalable ML Pipeline in Production, fourth course from Udacity Machine Learning DevOps Engineer NanoDegree. We train a classification model on Census Income Data Set and deploy this on Heroku using FastAPI.

How this repo is organized

Below we have the directory structure and annotations for relevant directoreis and files.

|-- Aptfile 
|-- Makefile
|-- Procfile
|-- README.md
|-- README_starter.md
|-- data  # Data directory, almost everything tracked by DVC and stored on S3
|   |-- processed  # Procesed data
|   |-- raw  # raw data, as downloaded from starter repository
|   |   `-- census.csv.dvc
|   `-- train_test_data  # training and testing data, after preprocessing.
|-- data_test  # data used for tests, not tracked by DVC
|   `-- processed
|       `-- census-fix-spaces-nodups.csv
|-- dvc.lock
|-- dvc.yaml
|-- dvc_on_heroku_instructions.md
|-- main.py # main app functions
|-- model # model artifacts and metrics
|   |-- slice_output.txt  # metrics on slices of categorical variables
|   `-- summary.json
|-- model_card.md
|-- EDA notebooks
|   `-- eda-1.ipynb
|-- reports # EDA reports using pandas profiling
|   |-- pandas-profiling-clean.html  # EDA on clean data
|   `-- pandas-profiling-raw.html  # EDA on raw data
|-- requirements.txt  # Python requirements
|-- run_live_post_get_live_api.py
|-- runtime.txt  # select Python version on Heroku
|-- sanitycheck.py
|   |-- continuous_deloyment.png # proof that continuous deployment is enabled on Heroku
|   |-- continuous_integration.png # proof that CI using github actions is set and passing
|   |-- dvcdag.png # DVC dag in ascii format
|   |-- dvcdag_dot_graphviz.png # DVC dag rendering dot format with graphviz
|   |-- dvcdag_outs.png # DVC dag (considering outputs) in ascii format
|   |-- dvcdag_outs_dot_graphviz.png # DVC dag (with outputs) rendering dot format with graphviz
|   |-- example.png # example of input body on FastAPI
|   |-- heroku_deploy_builds.png # image showing many builds on Heroku
|   `-- live_post_get.png # script, status codes and contents for both GET and POST METHODS
|-- setup.py # setup for package installation
|-- starter # our package
|   |-- __init__.py # random seed is set here
|   |-- ml
|   |   |-- __init__.py
|   |   |-- data.py # data preprocessing
|   |   |-- model.py # model training, inference, and metrics calculation
|   |   |-- test_data.py # tests for data.py
|   |   `-- test_model.py tests for model.py
|   `-- train_model.py # script for model training
|-- test_main.py # tests for main.py
|-- tmp_dvc_commands_run.sh # Draft commands used for runnning DVC pipeline (subsequent modifictaions were made directly on dvc.yaml file)
`-- tox.ini # configurations for flake8 and pytest 

DVC Pipeline

DVC was used for entire data and modeling pipeline. Below are dvc.yaml file contents:

    cmd: sed 's/, /,/g' data/raw/census.csv > data/processed/census-fix-spaces.csv
    - data/raw/census.csv
    - data/processed/census-fix-spaces.csv
    cmd: awk '{counts[$0]++;if (counts[$0] == 1) {print $0}}' data/processed/census-fix-spaces.csv
      > data/processed/census-fix-spaces-nodups.csv
    - data/processed/census-fix-spaces.csv
    - data/processed/census-fix-spaces-nodups.csv
    cmd: python3 starter/train_model.py
    - data/processed/census-fix-spaces-nodups.csv
    - starter/train_model.py
    - data/train_test_data/X_test.joblib
    - data/train_test_data/X_train.joblib
    - model/encoder.joblib
    - model/lb.joblib
    - model/model.joblib
    - model/slice_output.txt

    - model/summary.json:
        cache: false

PS.: The state of this repo considered during submission can be acessed on 0.0.1.