mlflow-spark-summit-2019 - pyspark

Overview

PySpark Decision Tree Classification example
Source: train.py and predict.py
Experiment name: pypark

Train

Unmanaged without mlflow run

To run with standard main function

spark-submit --master local[2] train.py --max_depth 16 --max_bins 32

Using mlflow run

These runs use the MLproject file. For more details see MLflow documentation - Running Projects.

Note that mlflow run ignores the set_experiment() function so you must specify the experiment with the --experiment-id argument.

mlflow run local

mlflow run . -P max_depth=3 -P max_bins=24 --experiment-id=2019

mlflow run github

mlflow run https://github.com/amesar/mlflow-fun.git#examples/pyspark \
   -P max_depth=3 -P max_bins=24 \
  --experiment-id=2019

Predict

See predict.py.

run_id=7b951173284249f7a3b27746450ac7b0
spark-submit --master local[2] predict.py $run_id