This repo is intended to demonstrate an end-to-end MLOps workflow on Databricks, where a model is deployed along with its ancillary pipelines to a specified (currently single) Databricks workspace.
Each pipeline (e.g model training pipeline, model deployment pipeline) is deployed as a Databricks job, where these jobs are deployed to a Databricks workspace using Databricks Labs' dbx
tool.
The use case at hand is a churn prediction problem. We use the IBM Telco Customer Churn dataset to build a simple classifier to predict whether a customer will churn from a fictional telco company.
Note that the package is solely developed via an IDE, and as such there are no Databricks Notebooks in the repository. All jobs are executed via a command line based workflow using dbx
.
The following pipelines currently defined within the package are:
demo-setup
- Deletes existing feature store tables, existing MLflow experiments and models registered to MLflow Model Registry, in order to start afresh for a demo.
feature-table-creation
- Creates new feature table and separate labels Delta table.
model-train
- Trains a scikit-learn Random Forest model
model-deployment
- Compare the Staging versus Production models in the MLflow Model Registry. Transition the Staging model to Production if outperforming the current Production model.
model-inference-batch
- Load a model from MLflow Model Registry, load features from Feature Store and score batch.
The following outlines the workflow to demo the repo.
-
Configure Databricks CLI connection profile
- The project is designed to use 3 different Databricks CLI connection profiles: dev, staging and prod. These profiles are set in e2e-mlops/.dbx/project.json.
- Note that for demo purposes we use the same connection profile for each of the 3 environments. In practice each profile would correspond to separate dev, staging and prod Databricks workspaces.
- This project.json file will have to be adjusted accordingly to the connection profiles a user has configured on their local machine.
-
Configure Databricks secrets for GitHub Actions (ensure GitHub actions are enabled for you forked project, as the default is off in a forked repo).
- Within the GitHub project navigate to Secrets under the project settings
- To run the GitHub actions workflows we require the following GitHub actions secrets:
DATABRICKS_STAGING_HOST
- URL of Databricks staging workspace
DATABRICKS_STAGING_TOKEN
- Databricks access token for staging workspace
DATABRICKS_PROD_HOST
- URL of Databricks production workspace
DATABRICKS_PROD_TOKEN
- Databricks access token for production workspace
GH_TOKEN
- GitHub personal access token
The following resources should not be present if starting from scratch:
- Feature table must be deleted
- The table e2e_mlops_testing.churn_features will be created when the feature-table-creation pipeline is triggered.
- MLflow experiment
- MLflow Experiments during model training and model deployment will be used in both the dev and prod environments. The paths to these experiments are configured in conf/deployment.yml.
- For demo purposes, we delete these experiments if they exist to begin from a blank slate.
- Model Registry
- Delete Model in MLflow Model Registry if exists.
NOTE: As part of the
initial-model-train-register
multitask job, the first taskdemo-setup
will delete these, as specified indemo_setup.yml
.
-
Run
PROD-telco-churn-initial-model-train-register
multitask job in prod environment-
To demonstrate a CICD workflow, we want to start from a “steady state” where there is a current model in production. As such, we will manually trigger a multitask job to do the following steps:
- Set up the workspace for the demo by deleting existing MLflow experiments and register models, along with existing Feature Store and labels tables.
- Create a new Feature Store table to be used by the model training pipeline.
- Train an initial “baseline” model
-
There is then a final manual step to promote this newly trained model to production via the MLflow Model Registry UI.
-
Outlined below are the detailed steps to do this:
- Run the multitask
PROD-telco-churn-initial-model-train-register
job via an automated job cluster in the prod environment (NOTE: multitask jobs can only be run viadbx deploy; dbx launch
currently).See the Limitations section below regarding running multitask jobs. In order to reduce cluster start up time you may want to consider using a Databricks pool, and specify this pool ID indbx deploy --jobs=PROD-telco-churn-initial-model-train-register --environment=prod --files-only dbx launch --job=PROD-telco-churn-initial-model-train-register --environment=prod --as-run-submit --trace
conf/deployment.yml
.
- Run the multitask
-
PROD-telco-churn-initial-model-train-register
tasks:- Demo setup task steps (
demo-setup
)- Delete Model Registry model if exists (archive any existing models).
- Delete MLflow experiment if exists.
- Delete Feature Table if exists.
- Feature table creation task steps (
feature-table-creation
)- Creates new churn_features feature table in the Feature Store
- Model train task steps (
model-train
)- Train initial “baseline” classifier (RandomForestClassifier -
max_depth=4
)- NOTE: no changes to config need to be made at this point
- Register the model. Model version 1 will be registered to
stage=None
upon successful model training. - Manual Step: MLflow Model Registry UI promotion to
stage='Production'
- Go to MLflow Model Registry and manually promote model to
stage='Production'
.
- Go to MLflow Model Registry and manually promote model to
- Train initial “baseline” classifier (RandomForestClassifier -
- Demo setup task steps (
-
-
Code change / model update (Continuous Integration)
- Create new “dev/new_model” branch
git checkout -b dev/new_model
- Make a change to the
model_train.yml
config file, updatingmax_depth
under model_params from 4 to 8- Optional: change run name under mlflow params in
model_train.yml
config file
- Optional: change run name under mlflow params in
- Create pull request, to instantiate a request to merge the branch dev/new_model into main.
- Create new “dev/new_model” branch
- On pull request the following steps are triggered in the GitHub Actions workflow:
- Trigger unit tests
- Trigger integration tests
- Note that upon tests successfully passing, this merge request will have to be confirmed in GitHub.
-
Cut release
-
Create tag (e.g.
v0.0.1
)git tag <tag_name> -a -m “Message”
- Note that tags are matched to
v*
, i.e.v1.0
,v20.15.10
- Note that tags are matched to
-
Push tag
git push origin <tag_name>
-
On pushing this the following steps are triggered in the
onrelease.yml
GitHub Actions workflow:- Trigger unit tests.
- Deploy
PROD-telco-churn-model-train
job to the prod environment. - Deploy
PROD-telco-churn-model-deployment
job to the prod environment. - Deploy
PROD-telco-churn-model-inference-batch
job to the prod environment.- These jobs will now all be present in the specified workspace, and visible under the Workflows tab.
-
-
Run
PROD-telco-churn-model-train
job in the prod environment-
Manually trigger job via UI
- In the Databricks workspace (prod environment) go to
Workflows
>Jobs
, where thePROD-telco-churn-model-train
job will be present. - Click into PROD-telco-churn-model-train and select ‘Run Now’. Doing so will trigger the job on the specified cluster configuration.
- In the Databricks workspace (prod environment) go to
-
Alternatively you can trigger the job using the Databricks CLI:
databricks jobs run-now –job-id JOB_ID
-
Model train job steps (
telco-churn-model-train
)- Train improved “new” classifier (RandomForestClassifier -
max_depth=8
) - Register the model. Model version 2 will be registered to stage=None upon successful model training.
- Manual Step: MLflow Model Registry UI promotion to stage='Staging'
- Go to Model registry and manually promote model to stage='Staging'
- Train improved “new” classifier (RandomForestClassifier -
ASIDE: At this point, there should now be two model versions registered in MLflow Model Registry:
- Version 1 (Production): RandomForestClassifier (
max_depth=4
) - Version 2 (Staging): RandomForestClassifier (
max_depth=8
)
-
-
Run
PROD-telco-churn-model-deployment
job (Continuous Deployment) in the prod environment-
Manually trigger job via UI
- In the Databricks workspace go to
Workflows
>Jobs
, where thetelco-churn-model-deployment
job will be present. - Click into telco-churn-model-deployment and click ‘Run Now’. Doing so will trigger the job on the specified cluster configuration.
- In the Databricks workspace go to
-
Alternatively you can trigger the job using the Databricks CLI:
databricks jobs run-now –job-id JOB_ID
-
Model deployment job steps (
PROD-telco-churn-model-deployment
)- Compare new “candidate model” in
stage='Staging'
versus current Production model instage='Production'
. - Comparison criteria set through
model_deployment.yml
- Compute predictions using both models against a specified reference dataset
- If Staging model performs better than Production model, promote Staging model to Production and archive existing Production model
- If Staging model performs worse than Production model, archive Staging model
- Compare new “candidate model” in
-
-
Run
PROD-telco-churn-model-inference-batch
job in the prod environment-
Manually trigger job via UI
- In the Databricks workspace go to
Workflows
>Jobs
, where thePROD-telco-churn-model-inference-batch
job will be present. - Click into telco-churn-model-inference-batch and click ‘Run Now’. Doing so will trigger the job on the specified cluster configuration.
- In the Databricks workspace go to
-
Alternatively you can trigger the job using the Databricks CLI:
databricks jobs run-now –job-id JOB_ID
-
Batch model inference steps (
PROD-telco-churn-model-inference-batch
)- Load model from stage=Production in Model Registry
- NOTE: model must have been logged to MLflow using the Feature Store API
- Use primary keys in specified inference input data to load features from feature store
- Apply loaded model to loaded features
- Write predictions to specified Delta path
- Load model from stage=Production in Model Registry
-
- Multitask jobs running against the same cluster
- The pipeline initial-model-train-register is a multitask job which stitches together demo setup, feature store creation and model train pipelines.
- At present, each of these tasks within the multitask job is executed on a different automated job cluster, rather than all tasks executed on the same cluster. As such, there will be time incurred for each task to acquire cluster resources and install dependencies.
- As above, we recommend using a pool from which instances can be acquired when jobs are launched to reduce cluster start up time.
While using this project, you need Python 3.X and pip
or conda
for package management.
pip install -r unit-requirements.txt
pip install -e .
For unit testing, please use pytest
:
pytest tests/unit --cov
Please check the directory tests/unit
for more details on how to use unit tests.
In the tests/unit/conftest.py
you'll also find useful testing primitives, such as local Spark instance with Delta support, local MLflow and DBUtils fixture.
There are two options for running integration tests:
- On an interactive cluster via
dbx execute
- On a job cluster via
dbx launch
For quicker startup of the job clusters we recommend using instance pools (AWS, Azure, GCP).
For an integration test on interactive cluster, use the following command:
dbx execute --cluster-name=<name of interactive cluster> --job=<name of the job to test>
For a test on an automated job cluster, deploy the job files and then launch:
dbx deploy --jobs=<name of the job to test> --files-only
dbx launch --job=<name of the job to test> --as-run-submit --trace
Please note that for testing we recommend using jobless deployments, so you won't affect existing job definitions.
dbx
expects that cluster for interactive execution supports%pip
and%conda
magic commands.- Please configure your job in
conf/deployment.yml
file. - To execute the code interactively, provide either
--cluster-id
or--cluster-name
.
dbx execute \
--cluster-name="<some-cluster-name>" \
--job=job-name
Multiple users also can use the same cluster for development. Libraries will be isolated per each execution context.
To start working with your notebooks from Repos, do the following steps:
- Add your git provider token to your user settings
- Add your repository to Repos. This could be done via UI, or via CLI command below:
databricks repos create --url <your repo URL> --provider <your-provider>
This command will create your personal repository under /Repos/<username>/telco_churn
.
3. To set up the CI/CD pipeline with the notebook, create a separate Staging
repo:
databricks repos create --url <your repo URL> --provider <your-provider> --path /Repos/Staging/telco_churn