This repository serves as a customizable template for the MLflow Regression Pipeline to develop high-quality production-ready regression models.
Currently supported ML models are limited to scikit-learn and frameworks that
integrate with scikit-learn, such as the XGBRegressor
API from XGBoost.
Note: MLflow Pipelines is an experimental feature in MLflow. If you observe any issues, please report them here. For suggestions on improvements, please file a discussion topic here. Your contribution to MLflow Pipelines is greatly appreciated by the community!
(Optional) Create a clean Python environment either via virtualenv or conda for the best experience. Python 3.7 or higher is required.
- Install the latest MLflow with Pipelines:
pip install mlflow[pipelines]
- Clone this MLflow Regression Pipeline template repository locally:
git clone https://github.com/mlflow/mlp-regression-template.git
- Enter the root directory of the cloned pipeline template:
cd mlp-regression-template
- Install the template dependencies:
pip install -r requirements.txt
To log pipeline runs to a particular MLflow experiment:
- Open
profiles/databricks.yaml
orprofiles/local.yaml
, depending on your environment. - Edit (and uncomment, if necessary) the
experiment
section, specifying the name of the desired experiment for logging.
Sync this repository with
Databricks Repos and run the notebooks/databricks
notebook on a Databricks Cluster running version 11.0 or greater of the
Databricks Runtime or the
Databricks Runtime for Machine Learning
with workspace files support enabled.
Note: When making changes to pipelines on Databricks, it is recommended that you either edit files on your local machine and use dbx to sync them to Databricks Repos, as demonstrated here, or edit files in Databricks Repos by opening separate browser tabs for each YAML file or Python code module that you wish to modify.
For the latter approach, we recommend opening at least 3 browser tabs to facilitate easier development:
- One tab for modifying configurations in
pipeline.yaml
and / orprofiles/{profile}.yaml
- One tab for modifying step function(s) defined in
steps/{step}.py
- One tab for modifying and running the driver notebook (
notebooks/databricks
)
You can find MLflow Experiments and MLflow Runs created by the pipeline on the Databricks ML Experiments page.
- Launch the Jupyter Notebook environment via the
jupyter notebook
command. - Open and run the
notebooks/jupyter.ipynb
notebook in the Jupyter environment.
First, enter the template root directory and set the profile via environment variable
cd mlp-regression-template
export MLFLOW_PIPELINES_PROFILE=local
Then, try running the
following MLflow Pipelines CLI
commands to get started.
Note that the --step
argument is optional.
Pipeline commands without a --step
specified act on the entire pipeline instead.
Available step names are: ingest
, split
, transform
, train
, evaluate
and register
.
- Display the help message:
mlflow pipelines --help
- Run a pipeline step or the entire pipeline:
mlflow pipelines run --step step_name
- Inspect a step card or the pipeline dependency graph:
mlflow pipelines inspect --step step_name
- Clean a step cache or all step caches:
mlflow pipelines clean --step step_name
Note: a short cut to mlflow pipelines
is installed as mlp
.
For example, to run the ingest step,
instead of issuing mlflow pipelines run --step ingest
, you may type
mlp -s ingest
To view MLflow Experiments and MLflow Runs created by the pipeline:
-
Enter the template root directory:
cd mlp-regression-template
-
Start the MLflow UI
mlflow ui \
--backend-store-uri sqlite:///metadata/mlflow/mlruns.db \
--default-artifact-root ./metadata/mlflow/mlartifacts \
--host localhost
- Open a browser tab pointing to http://127.0.0.1:5000