Machine Learning on census data with FastAPI

This project is a demonstration of how to use FastAPI to create a REST API for a machine learning model. The project uses DVC to manage the project dependencies and allow for a reproducible ML pipeline.

Tech Stack

Python, DVC, scikit-learn, pandas, FastAPI and Conda

DVC Dag

The DVC pipeline can also be viewed in the terminal with the command:

dvc dag

Output

+-----------------------------+  
| starter/data/census.csv.dvc |  
+-----------------------------+  
                *                
                *                
                *                
        +------------+           
        | clean_data |           
        +------------+           
                *                
                *                
                *                
        +-------------+          
        | train_model |          
        +-------------+          
                *                
                *                
                *                
          +----------+           
          | evaluate |           
          +----------+

Environment Set up

The Makefile contains the commands to set up the environment for the project. This will create a conda environment and install the dependencies. If you prefer pip to install the dependencies, you can use the requirements.txt file.

Notes from the project

The commands bellow are more or less the same as the ones used to create the project. They are note important to clone and run the project.

Initialize and start using dvc inside git repository.

dvc init

Start to track the UCI census data file.

dvc add starter/data/census.csv

Store file in AWS S3 bucket

dvc remote add -d storage s3://<name-of-s3-bucket>

Tell dvc to use the AWS profile named udacity, instead of the default profile.

dvc remote modify storage profile udacity

Run the clean_data.py script

dvc run -n clean_data -d starter/data/census.csv -d starter/starter/clean_data.py -o starter/data/census_clean.csv --no-exec python starter/starter/clean_data.py

DVC pipeline can be run with dvc repro command.

Deploy the project

Create Heroku application

heroku create marcus-census-fastapi --buildpack heroku/python

set git remote heroku to https://git.heroku.com/marcus-census-fastapi.git

heroku git:remote --app marcus-census-fastapi

Add extra buildpack layer for DVC, also see Aptfile

heroku buildpacks:add --index 1 heroku-community/apt

Run git push heroku main to create a new release using these buildpacks.

git push heroku main

Add AWS configuration keys

heroku config:set AWS_ACCESS_KEY_ID=xxx AWS_SECRET_ACCESS_KEY=yyy