To understand a model's functionality and find its underlying problems, we need to take care of the following things:
- Periodic training
- Figure out an optimal retraining strategy.
- Monitor model performance
- Analyse
- Clear visibility of the model helps us and guides the model performance.
In this section we will diagnose and fix problems in a production deployed code. To that end, we will:
- Use evidently and mlflow libraries.
- Deploy a machine learning model in Heroku.
- Calculate data drift for the model.
- Use mlflow Tracking for the training experiments indicating data drift.
- Explore the results using mlflow UI.
In this tutorial we will use the Bike Sharing Dataset.
You can use the following code snippet (from train.py
) to analyze this data before starting this tutorial.
content = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip").content
with zipfile.ZipFile(io.BytesIO(content)) as arc:
raw_data = pd.read_csv(arc.open("day.csv"), header=0, sep=',', parse_dates=['dteday'], index_col='dteday')
# observe data structure
raw_data.tail()
- Heroku account
- GitHub account
- Clone the Github repo
https://github.com/udacity/cd0583-diagnose-and-fix.git
- All the dependencies are listed in the
requirements.txt
file. You can setup a virtual environment using Anaconda and install the required dependencies there. runtime.txt
contains the python version that is used for this tutorial.
Follow the steps below for deploying this model:
- Ensure that all the dependencies listed in the
requirements.txt
file are installed. - Run the
train.py
file to log experiments in mlflow - View the results in the mlflow webui
- If there is substantial data drift then you should reweigh samples in the training data, giving more importance to newer patterns.
- Identify new segments where the model fails, and create a different model for it. Consider using an ensemble of several models for different segments of the data.
- Change the prediction target. For example, switch from weekly to daily forecast or replace the regression model with classification into categories from "high" to "low."
- Pick a different model architecture to account for ongoing drift. You can consider incremental or online learning, where the model continuously adapts to new data.
- Apply domain adaptation strategies. There are a number of approaches to help the model better generalize to a new target domain.