/e2e-ml-on-k8s

Realizing End to End Reproducible Machine Learning on Kubernetes

Primary LanguagePython

End to End Reproducible Machine Learning on Kubernetes

This application is sample app created to demonstrate how to realizing end to end reproducible machine learning on Kubernetes. This was referenced in KubeCon US 2019 @ San Diego. Slides for this is available here.

Model Network

At core, this application solves a deep learning semantic segmentation problem using U-Net with MobileNetV2 or VGG-19 convolution network as backbone. The model is inspired from this tensorflow demo example but is modified further for bunch of other things.

The Data

The end to end example is tested with Oxford Universities Pet Dataset that segments pets image into 3 non-overlapping categories a) Pet, b) Background and 3) Unknown

Oxford Pet Dataset Sample

However the code has been used in multi-label scenario.

Sample structure

All the top level python script for this project is in app whereas library pylib wraps core functionality. Dockerfile can be found here. End to end can be used to run this end to end locally/in container.

Environment Setup

Kubernetes

Reproducibility starts with environment. The whole cluster including application runtime needs to be version controlled. This app uses gitops concept to version environment. To realize gitops, it defined ArgoCD apps to setup on any Kubernetes cluster.

ArgoCD

The ArgoCD App installs the following:

Due to above installation following capabilities are present in cluster:

More information & specifics about configuring infrastructure and all k8s related runtime is located in cluster-conf. See readme for more information.

Local Environment

Easiest way to setup is using docker image suneetamall/e2e-ml-on-k8s from dockerhub However, to create local environment see:

Creating Python Environment:

This app was worked with conda 4.7.11 and Python 3.7.3. But spec of environment is detailed here and can be used to create virtual environment file as following:

    conda env create -f environment.yml

For more details on this see here

If using virtualenv,

virtualenv tf2 --python=python3.7.3
source tf2/bin/activate

requirements are listed here with pylib located in pylib.

Demo

Real machine learning workflow

See ml-workflow for information on individual steps of above workflow.

Finally, see demo notebook