Kale is a Python package that aims at automatically deploy a general purpose Jupyter Notebook as a running Kubeflow Pipelines instance, without requiring the use the specific KFP DSL.
The general idea of kale is to automatically arrange the cells included in a notebook, and transform them into a unified KFP-compliant pipeline. To do so, the user is only required to decide which cells correspond to which pipeline step, by the use of tags. In this way, a researcher can better focus on building and testing its code locally, and then scale it in a simple, organized and controlled way.
Install Kale from PyPI and run it over one of the provided examples.
# install kale
pip install kubeflow-kale
# download a tagged example notebook
wget https://raw.githubusercontent.com/kubeflow-kale/examples/master/titanic-ml-dataset/titanic_dataset_ml.ipynb
# convert the notebook to a python script that defines a kfp pipeline
kale --nb titanic_dataset_ml.ipynb
This will generate generate kaggle-titanic.kfp.py
containing a runnable pipeline defined using the KFP Python DSL. Have a look at the code to get a feeling of the magic Kale is performing under the hood.
In case you are running Kale in a Kubeflow Notebook Server, you can add the --run_pipeline
flag to convert and run the pipeline automatically:
kale --nb titanic_dataset_ml.ipynb --run_pipeline
will convert the Notebook and start a new run. Switch over to the KFP UI under the Experiments tabs so the running pipeline.
The best way to exploit the potential of Kale is to run JupyterLab with the Jupyter Kale extension installed.
Jupyter provides a tagging feature out-of-the-box, that lets you associate each cells with custom defined tags.
The tags are used to tell Kale how to convert the notebook's code cells into an execution graph, by specifying the execution dependencies between the pipeline steps and which code cells to merge together.
The list of tags recognized by Kale:
Tag | Description |
---|---|
block:<block_name> |
Assign the current cell to a pipeline step |
prev:<block_name> |
Define a dependency of the current cell to other pipeline steps |
imports |
Code to be added at the beginning of every pipeline step. This is particularly useful with cells containing import statements |
functions |
Code to be added at the beginning of every pipeline step, but after import statements. This is particularly useful for functions or statements used in multiple pipeline steps |
parameters |
To be used in cells that contain just variable assignment of primitive types (int , str , float , bool ). These variables will be used as pipeline parameters, using the assigned values as defaults |
skip |
Do not include the code of the cell in the pipeline |
Note: <block_name>
must consist of lower case alphanumeric characters or _
, and can not start with a digit (matching regex: ^[_a-z][_a-z0-9]$
).
Multiple code cells performing a related task (e.g. some data processing) can be merged into a single pipeline step by tagging the first one with a block tag (e.g. block:data_processing
) and leaving the below cells empty of any tags. Kale will merge any untagged cell with the first tagged cell above - always skipping cells tagged with the skip
tag.
Cell can be merged even if not contiguous in the notebook, just by tagging them with the same block name - the order of cells in the notebook will be preserved in the resulting pipeline step.
The ell tags can be added to the tags
section of a code cell metadata (see nbformat doc on cell metadata).
In order to deploy a pipeline to Kubeflow Pipeline, Kale needs several information like the name of the experiment and the pipeline, its description, volume mounts, etc...
All this information can be embedded into the Notebook in the metadata
section (see the nbformat spec). Kale expects an entry in the metadata
section named kubeflow_notebook
with the following spec:
Key | Required | Description | Spec |
---|---|---|---|
pipeline_experiment |
Yes | Name of the KFP Experiment | Free Text |
pipeline_name |
Yes | Name of the pipeline | Alphanumeric characters or - |
pipeline_description |
No | Description of the pipeline | Free Text |
volumes |
No | A list of volume specs | See below the Volume spec |
docker_image |
No | Base docker image for pipeline steps | - |
Volume spec:
Key | Required | Description | Spec |
---|---|---|---|
type |
Yes | Type of Volume to be mounted | One of pv , pvc , new_pvc |
name |
Yes | Name of the existing or new resource | K8s compliant resource name |
mount_point |
Yes | The mount point in the pipeline step fs | Valid Unix path |
size |
Yes (for pv and new_pvc options |
Size of Volume | Integer |
size_type |
Yes (when defining size ) |
Storage size | One of Gi , Mi , Ki |
snapshot |
Yes | When true: snapshot volume at the end of pipeline | Bool |
snapshot_name |
Yes (when snapshot True) |
Name of the snapshot resource | K8s compliant resource name |
A sample Notebook metadata:
"kubeflow_notebook": {
"experiment_name": "Titanic Experiment",
"pipeline_name": "ml-comparison",
"pipeline_description": "ML Pipeline predicting survival score of passengers of Titanic",
"docker_image": "docker.io/kubeflow-kale/launcher:latest",
"volumes": [
{
"type": "new_pvc",
"name": "titanic-data-pvc",
"mount_point": "/data",
"size": "1",
"size_type": "Gi",
"snapshot": true,
"snapshot_name": "titanic-data-snapshot"
}
]
}
When splitting a Notebook into separate execution steps (each pipeline step runs inside its own docker container) the data dependencies between the steps would not allow the proper execution of the pipeline.
Kale is able to provide a seamless execution without any user intervention by detecting these data dependencies that marshalling the necessary data between the steps.
Just clone the repository to your local machine and install the package in a virtual environment.
# Clone the repo to your local environment
git clone https://github.com/kubeflow-kale/kale
cd kale
# Install the package in your virtualenv
python install -r requirements.txt
For a more detailed explanation of the internals of Kale, head over to Architecture.md