/kubeflow

Machine Learning Toolkit for Kubernetes

Primary LanguagePythonApache License 2.0Apache-2.0

Kubeflow

Prow test dashboard Prow jobs dashboard

The Kubeflow project is dedicated to making Machine Learning on Kubernetes easy, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way for spinning up best of breed OSS solutions. Contained in this repository are manifests for creating:

  • A JupyterHub to create & manage interactive Jupyter notebooks
  • A Tensorflow Training Controller that can be configured to use CPUs or GPUs, and adjusted to the size of a cluster with a single setting
  • A TF Serving container

This document details the steps needed to run the Kubeflow project in any environment in which Kubernetes runs.

The Kubeflow Mission

Our goal is to help folks use ML more easily, by letting Kubernetes to do what it's great at:

  • Easy, repeatable, portable deployments on a diverse infrastructure (laptop <-> ML rig <-> training cluster <-> production cluster)
  • Deploying and managing loosely-coupled microservices
  • Scaling based on demand

Because ML practitioners use so many different types of tools, it is a key goal that you can customize the stack to whatever your requirements (within reason), and let the system take care of the "boring stuff." While we have started with a narrow set of technologies, we are working with many different projects to include additional tooling.

Ultimately, we want to have a set of simple manifests that give you an easy to use ML stack anywhere Kubernetes is already running and can self configure based on the cluster it deploys into.

Who should consider using Kubeflow?

Based on the current functionality you should consider using Kubeflow if

  • You want to train/serve TensorFlow models in different environments (e.g. local, on prem, and cloud)
  • You want to use Jupyter notebooks to manage TensorFlow training jobs
    • kubeflow is particularly helpful if you want to launch training jobs that use more resources (more nodes or more GPUs) than your notebook.
  • You want to combine TensorFlow with other processes
    • For example if you want to use tensorflow/agents to run simulations to generate data for training reinforcement learning models

This list is based ONLY on current capabilities. We are investing significant resources to expand the functionality and actively soliciting help from companies and inviduals interested in contributing (see below)

Setup

This documentation assumes you have a Kubernetes cluster already available.

If you need help setting up a Kubernetes cluster please refer to Kubernetes Setup.

If you want to use GPUs be sure to follow the Kubernetes instructions for enabling GPUs.

Quick Start

Requirements

Steps

In order to quickly set up all components, execute the following commands,

# Initialize a ksonnet APP
APP_NAME=my-kubeflow
ks init ${APP_NAME}
cd ${APP_NAME}

# Install Kubeflow components
ks registry add kubeflow github.com/google/kubeflow/tree/master/kubeflow
ks pkg install kubeflow/core
ks pkg install kubeflow/tf-serving
ks pkg install kubeflow/tf-job

# Deploy Kubeflow
ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE}
ks apply default -c kubeflow-core

The above command sets up JupyterHub and a custom resource for running TensorFlow training jobs. Furthermore, the ksonnet packages provide prototypes that can be used to configure TensorFlow jobs and deploy TensorFlow models. Used together, these make it easy for a user go from training to serving using Tensorflow with minimal effort in a portable fashion between different environments.

For more detailed instructions about how to use Kubeflow please refer to the user guide

Troubleshooting

Minikube

On Minikube the Virtualbox/VMware drivers for Minikube are recommended as there is a known issue between the KVM/KVM2 driver and TensorFlow Serving. The issue is tracked in kubernetes/minikube#2377.

RBAC clusters

If you are running on a K8s cluster with RBAC enabled, you may get an error like the following when deploying Kubeflow:

ERROR Error updating roles kubeflow-test-infra.jupyter-role: roles.rbac.authorization.k8s.io "jupyter-role" is forbidden: attempt to grant extra privileges: [PolicyRule{Resources:["*"], APIGroups:["*"], Verbs:["*"]}] user=&{your-user@acme.com  [system:authenticated] map[]} ownerrules=[PolicyRule{Resources:["selfsubjectaccessreviews"], APIGroups:["authorization.k8s.io"], Verbs:["create"]} PolicyRule{NonResourceURLs:["/api" "/api/*" "/apis" "/apis/*" "/healthz" "/swagger-2.0.0.pb-v1" "/swagger.json" "/swaggerapi" "/swaggerapi/*" "/version"], Verbs:["get"]}] ruleResolutionErrors=[]

This error indicates you do not have sufficient permissions. In many cases you can resolve this just by creating an appropriate clusterrole binding like so and then redeploying kubeflow

kubectl create clusterrolebinding default-admin --clusterrole=cluster-admin --user=your-user@acme.com
  • Replace your-user@acme.com with the user listed in the error message.

If you're using, GKE you may want to refer to GKE's RBAC docs to understand how RBAC interacts with IAM on GCP.

Resources

Get involved

Who should consider contributing to Kubeflow?

  • Folks who want to add support for other ML frameworks (e.g. PyTorch, XGBoost, etc...)
  • Folks who want to bring more Kubernetes magic to ML (e.g. ISTIO integration for prediction)
  • Folks who want to make Kubeflow a richer ML platform (e.g. support for ML pipelines, hyperparameter tuning)
  • Folks who want to tune Kubeflow for their particular Kubernetes distribution or Cloud
  • Folks who want to write tutorials/blog posts showing how to use Kubeflow to solve ML problems