/kale

Convert a JupyterNotebook to a Kubeflow Pipeline deployment.

Primary LanguagePythonApache License 2.0Apache-2.0

⚠️ NOTICE ⚠️: The latest and greatest features are being developed in branch kubecon-workshop in preparation of this tutorial session at KubeCon North America on Nov 21st.

Note that we are targeting for a seamless experience on MiniKF for the tutorial, so running Kale on local machines or on K8S clusters might result in unexpected behaviors. We will work to merge all the new features to master and support all environments after the conference.

Kale Logo


Kale is a Python package that aims at automatically deploy a general purpose Jupyter Notebook as a running Kubeflow Pipelines instance, without requiring the use the specific KFP DSL.

The general idea of kale is to automatically arrange the cells included in a notebook, and transform them into a unified KFP-compliant pipeline. To do so, the user is only required to decide which cells correspond to which pipeline step, by the use of tags. In this way, a researcher can better focus on building and testing its code locally, and then scale it in a simple, organized and controlled way.

Getting started

Install Kale from PyPI and run it over one of the provided examples.

# install kale
pip install kubeflow-kale

# download a tagged example notebook
wget https://raw.githubusercontent.com/kubeflow-kale/examples/master/titanic-ml-dataset/titanic_dataset_ml.ipynb
# convert the notebook to a python script that defines a kfp pipeline
kale --nb titanic_dataset_ml.ipynb

This will generate generate kaggle-titanic.kfp.py containing a runnable pipeline defined using the KFP Python DSL. Have a look at the code to get a feeling of the magic Kale is performing under the hood.

Deploy the pipeline

In case you are running Kale in a Kubeflow Notebook Server, you can add the --run_pipeline flag to convert and run the pipeline automatically:

kale --nb titanic_dataset_ml.ipynb --run_pipeline

will convert the Notebook and start a new run. Switch over to the KFP UI under the Experiments tabs so the running pipeline.

Jupyter UI

The best way to exploit the potential of Kale is to run JupyterLab with the Jupyter Kale extension installed.

Tagging language

Jupyter provides a tagging feature out-of-the-box, that lets you associate each cells with custom defined tags.

The tags are used to tell Kale how to convert the notebook's code cells into an execution graph, by specifying the execution dependencies between the pipeline steps and which code cells to merge together.

The list of tags recognized by Kale:

Tag Description
block:<block_name> Assign the current cell to a pipeline step
prev:<block_name> Define a dependency of the current cell to other pipeline steps
imports Code to be added at the beginning of every pipeline step. This is particularly useful with cells containing import statements
functions Code to be added at the beginning of every pipeline step, but after import statements. This is particularly useful for functions or statements used in multiple pipeline steps
parameters To be used in cells that contain just variable assignment of primitive types (int, str, float, bool). These variables will be used as pipeline parameters, using the assigned values as defaults
skip Do not include the code of the cell in the pipeline

Note: <block_name> must consist of lower case alphanumeric characters or _, and can not start with a digit (matching regex: ^[_a-z][_a-z0-9]$).

Cell Merging

Multiple code cells performing a related task (e.g. some data processing) can be merged into a single pipeline step by tagging the first one with a block tag (e.g. block:data_processing) and leaving the below cells empty of any tags. Kale will merge any untagged cell with the first tagged cell above - always skipping cells tagged with the skip tag.

Cell can be merged even if not contiguous in the notebook, just by tagging them with the same block name - the order of cells in the notebook will be preserved in the resulting pipeline step.

Notebook Tags

The ell tags can be added to the tags section of a code cell metadata (see nbformat doc on cell metadata).

Notebook Metadata Spec

In order to deploy a pipeline to Kubeflow Pipeline, Kale needs several information like the name of the experiment and the pipeline, its description, volume mounts, etc...

All this information can be embedded into the Notebook in the metadata section (see the nbformat spec). Kale expects an entry in the metadata section named kubeflow_notebook with the following spec:

Key Required Description Spec
pipeline_experiment Yes Name of the KFP Experiment Free Text
pipeline_name Yes Name of the pipeline Alphanumeric characters or -
pipeline_description No Description of the pipeline Free Text
volumes No A list of volume specs See below the Volume spec
docker_image No Base docker image for pipeline steps -

Volume spec:

Key Required Description Spec
type Yes Type of Volume to be mounted One of pv, pvc, new_pvc
name Yes Name of the existing or new resource K8s compliant resource name
mount_point Yes The mount point in the pipeline step fs Valid Unix path
size Yes (for pv and new_pvc options Size of Volume Integer
size_type Yes (when defining size) Storage size One of Gi, Mi, Ki
snapshot Yes When true: snapshot volume at the end of pipeline Bool
snapshot_name Yes (when snapshot True) Name of the snapshot resource K8s compliant resource name

A sample Notebook metadata:

"kubeflow_notebook": {
    "experiment_name": "Titanic Experiment",
    "pipeline_name": "ml-comparison",
    "pipeline_description": "ML Pipeline predicting survival score of passengers of Titanic",
    "docker_image": "docker.io/kubeflow-kale/launcher:latest",
    "volumes": [
        {
            "type": "new_pvc",
            "name": "titanic-data-pvc",
            "mount_point": "/data",
            "size": "1",
            "size_type": "Gi",
            "snapshot": true,
            "snapshot_name": "titanic-data-snapshot"
        }
    ]
}

Data passing

When splitting a Notebook into separate execution steps (each pipeline step runs inside its own docker container) the data dependencies between the steps would not allow the proper execution of the pipeline.

Kale is able to provide a seamless execution without any user intervention by detecting these data dependencies that marshalling the necessary data between the steps.

Contribute

Just clone the repository to your local machine and install the package in a virtual environment.

# Clone the repo to your local environment
git clone https://github.com/kubeflow-kale/kale
cd kale
# Install the package in your virtualenv
python install -r requirements.txt

For a more detailed explanation of the internals of Kale, head over to Architecture.md