This application is sample app created to demonstrate how to realizing end to end reproducible machine learning on Kubernetes. This was referenced in KubeCon US 2019 @ San Diego. Slides for this is available here.
At core, this application solves a deep learning semantic segmentation problem using U-Net with MobileNetV2 or VGG-19 convolution network as backbone. The model is inspired from this tensorflow demo example but is modified further for bunch of other things.
The end to end example is tested with Oxford Universities Pet Dataset that segments pets image into 3 non-overlapping categories a) Pet, b) Background and 3) Unknown
However the code has been used in multi-label scenario.
All the top level python script for this project is in app whereas library pylib wraps core functionality. Dockerfile can be found here. End to end can be used to run this end to end locally/in container.
Reproducibility starts with environment. The whole cluster including application runtime needs to be version controlled. This app uses gitops concept to version environment. To realize gitops, it defined ArgoCD apps to setup on any Kubernetes cluster.
The ArgoCD App installs the following:
- Kubeflow 0.6.2
- Pachyderm 1.9.8
- Seldon 0.4.1
- Istio 1.1.0
- A ML-User RBAC to execute ML Operators defined by Kubeflow
Due to above installation following capabilities are present in cluster:
- Jupyter Notebook via Kubeflow
- Training frameworks TFJob, TorchJob etc. via Kubeflow
- DAG pipelines: Kubeflow pipelines, Pachyderm, Argo
- Hyper Parameter Tuning: Katib, Ray
- Serving (Seldon, TFServe etc.)
- Service Mesh: Istio
More information & specifics about configuring infrastructure and all k8s related runtime is located in cluster-conf. See readme for more information.
Easiest way to setup is using docker image suneetamall/e2e-ml-on-k8s
from dockerhub
However, to create local environment see:
This app was worked with conda 4.7.11
and Python 3.7.3
. But spec of environment is detailed here
and can be used to create virtual environment file as following:
conda env create -f environment.yml
For more details on this see here
If using virtualenv,
virtualenv tf2 --python=python3.7.3
source tf2/bin/activate
requirements are listed here with pylib
located in pylib.
See ml-workflow for information on individual steps of above workflow.
Finally, see demo notebook