/ml-pipelines

Example repo to kickstart integration with mlflow pipelines.

Primary LanguagePythonApache License 2.0Apache-2.0

MLflow Pipelines Regression Template

This repository serves as a customizable template for the MLflow Regression Pipeline to develop high-quality production-ready regression models.

Currently supported ML models are limited to scikit-learn and frameworks that integrate with scikit-learn, such as the XGBRegressor API from XGBoost.

Note: MLflow Pipelines is an experimental feature in MLflow. If you observe any issues, please report them here. For suggestions on improvements, please file a discussion topic here. Your contribution to MLflow Pipelines is greatly appreciated by the community!

Installation instructions

(Optional) Create a clean Python environment either via virtualenv or conda for the best experience. Python 3.7 or higher is required.

  1. Install the latest MLflow with Pipelines:
pip install mlflow[pipelines]
  1. Clone this MLflow Regression Pipeline template repository locally:
git clone https://github.com/mlflow/mlp-regression-template.git
  1. Enter the root directory of the cloned pipeline template:
cd mlp-regression-template
  1. Install the template dependencies:
pip install -r requirements.txt

Log to the designated MLflow Experiment

To log pipeline runs to a particular MLflow experiment:

  1. Open profiles/databricks.yaml or profiles/local.yaml, depending on your environment.
  2. Edit (and uncomment, if necessary) the experiment section, specifying the name of the desired experiment for logging.

Development Environment -- Databricks

Sync this repository with Databricks Repos and run the notebooks/databricks notebook on a Databricks Cluster running version 11.0 or greater of the Databricks Runtime or the Databricks Runtime for Machine Learning with workspace files support enabled.

Note: When making changes to pipelines on Databricks, it is recommended that you either edit files on your local machine and use dbx to sync them to Databricks Repos, as demonstrated here, or edit files in Databricks Repos by opening separate browser tabs for each YAML file or Python code module that you wish to modify.

For the latter approach, we recommend opening at least 3 browser tabs to facilitate easier development:

  • One tab for modifying configurations in pipeline.yaml and / or profiles/{profile}.yaml
  • One tab for modifying step function(s) defined in steps/{step}.py
  • One tab for modifying and running the driver notebook (notebooks/databricks)

Accessing MLflow Pipeline Runs

You can find MLflow Experiments and MLflow Runs created by the pipeline on the Databricks ML Experiments page.

Development Environment -- Local machine

Jupyter

  1. Launch the Jupyter Notebook environment via the jupyter notebook command.
  2. Open and run the notebooks/jupyter.ipynb notebook in the Jupyter environment.

Command-Line Interface (CLI)

First, enter the template root directory and set the profile via environment variable

cd mlp-regression-template
export MLFLOW_PIPELINES_PROFILE=local

Then, try running the following MLflow Pipelines CLI commands to get started. Note that the --step argument is optional. Pipeline commands without a --step specified act on the entire pipeline instead.

Available step names are: ingest, split, transform, train, evaluate and register.

  • Display the help message:
mlflow pipelines --help
  • Run a pipeline step or the entire pipeline:
mlflow pipelines run --step step_name
  • Inspect a step card or the pipeline dependency graph:
mlflow pipelines inspect --step step_name
  • Clean a step cache or all step caches:
mlflow pipelines clean --step step_name

Note: a short cut to mlflow pipelines is installed as mlp. For example, to run the ingest step, instead of issuing mlflow pipelines run --step ingest, you may type

mlp -s ingest

Accessing MLflow Pipeline Runs

To view MLflow Experiments and MLflow Runs created by the pipeline:

  1. Enter the template root directory: cd mlp-regression-template

  2. Start the MLflow UI

mlflow ui \
   --backend-store-uri sqlite:///metadata/mlflow/mlruns.db \
   --default-artifact-root ./metadata/mlflow/mlartifacts \
   --host localhost
  1. Open a browser tab pointing to http://127.0.0.1:5000