Katib is a Kubernetes Native System for Hyperparameter Tuning and Neural Architecture Search. The system is inspired by Google vizier and supports multiple ML/DL frameworks (e.g. TensorFlow, MXNet, and PyTorch).
Table of Contents generated with DocToc
- Name
- Concepts in Katib
- Components in Katib
- Getting Started
- Web UI
- API Documentation
- Quickstart to run tfjob and pytorch operator jobs in Katib
- CONTRIBUTING
Katib stands for secretary
in Arabic. As Vizier
stands for a high official or a prime minister in Arabic, this project Katib is named in the honor of Vizier.
Katib has the concepts of Experiment, Trial, Job and Suggestion.
Experiment
represents a single optimization run over a feasible space.
Each Experiment
contains a configuration describing the feasible space, as well as a set of Trials.
It is assumed that objective function f(x) does not change in the course of a Experiment
.
In v1alpha1, Experiment
is defined as a CRD StudyJob
in Kubernetes.
In v1alpha2, Experiment
is defined as a CRD Experiment
.
A Trial
is a list of parameter values, x, that will lead to a single evaluation of f(x). A Trial can be “Completed”, which means that it has been evaluated and the objective value f(x) has been assigned to it, otherwise it is “Pending”.
In v1alpha1, Trial
is just a concept inside Katib and not exposed to users.
In v1alpha2, Trial
is defined as a CRD Trial
in Kubernetes.
A Job
refers to a process responsible for evaluating a Pending Trial
and calculating its objective value.
The job kind can be Kubernetes Job, Kubeflow TFJob or Kubeflow PyTorchJob. Thus Katib supports multiple frameworks with the help of different job kinds.
A Suggestion is an algorithm to construct a parameter set according to the Experiment
. Then Trial
will be created to evaluate the parameter set.
Currently Katib supports the following exploration algorithms in v1alpha1:
- random search
- grid search
- hyperband
- bayesian optimization
- NAS based on reinforcement learning
- NAS based on EnvelopeNets
And Katib supports the following exploration algorithms in v1alpha2:
- random search
Katib consists of several components as shown below. Each component is running on k8s as a deployment.
Each component communicates with others via GRPC and the API is defined at pkg/api/v1alpha1/api.proto
.
- vizier: main components.
- vizier-core: GRPC API server of vizier.
- vizier-core-rest: REST API server of vizier.
- vizier-db: Data storage backend of vizier.
- suggestion: implementation of each exploration algorithm.
- suggestion-random
- suggestion-grid
- suggestion-hyperband
- suggestion-bayesianoptimization
- suggestion-nasrl
- suggestion-nasenvelopenets
- studyjob-controller: Controller for
StudyJob
CRD in Kubernetes. - modeldb : WebUI
- modeldb-frontend
- modeldb-backend
- modeldb-db
Katib consists of several components as shown below. Each component is running on k8s as a deployment.
Each component communicates with others via GRPC and the API is defined at pkg/api/v1alpha2/api.proto
.
- katib: main components.
- katib-manager: GRPC API server of katib.
- katib-manager-rest: REST API server of katib.
- katib-db: Data storage backend of katib.
- katib-ui: User interface of katib.
- suggestion: implementation of each exploration algorithm.
- suggestion-random
- katib-controller: Controller for katib CRDs in Kubernetes.
- experiment-controller: Controller for
Experiment
CRD in Kubernetes. - trial-controller: Controller for
Trial
CRD in Kubernetes.
Please see here for more details about katib v1alpha1.
Please see here for more details about katib v1alpha2.
Katib provides a Web UI. You can visualize general trend of Hyper parameter space and each training history. You can use random-example or other examples to generate a similar UI.
Please refer to api.md.
For running tfjob and pytorch operator jobs in Katib, you have to install their packages.
In your Ksonnet app root, run the following
export KF_ENV=default
ks env set ${KF_ENV} --namespace=kubeflow
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
For installing tfjob operator, run the following
ks pkg install kubeflow/tf-training
ks pkg install kubeflow/common
ks generate tf-job-operator tf-job-operator
ks apply ${KF_ENV} -c tf-job-operator
For installing pytorch operator, run the following
ks pkg install kubeflow/pytorch-job
ks generate pytorch-operator pytorch-operator
ks apply ${KF_ENV} -c pytorch-operator
Finally, you can install Katib
ks pkg install kubeflow/katib
ks generate katib katib
ks apply ${KF_ENV} -c katib
If you want to use Katib not in GKE and you don't have StorageClass for dynamic volume provisioning at your cluster, you have to create persistent volume to bound your persistent volume claim.
This is yaml file for persistent volume
apiVersion: v1
kind: PersistentVolume
metadata:
name: katib-mysql
labels:
type: local
app: katib
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/katib
Create this pv after deploying Katib package
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha1/pv/pv.yaml
After deploy everything, you can run examples.
To run tfjob operator example, you have to install volume for it.
If you are using GKE and default StorageClass, you have to create this pvc
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tfevent-volume
namespace: kubeflow
labels:
type: local
app: tfjob
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
If you are not using GKE and you don't have StorageClass for dynamic volume provisioning at your cluster, you have to create pvc and pv
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha1/tfevent-volume/tfevent-pvc.yaml
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha1/tfevent-volume/tfevent-pv.yaml
This is example for tfjob operator
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha1/tfjob-example.yaml
This is example for pytorch operator
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha1/pytorchjob-example.yaml
You can check status of StudyJob
$ kubectl describe studyjob pytorchjob-example -n kubeflow
Name: pytorchjob-example
Namespace: kubeflow
Labels: controller-tools.k8s.io=1.0
Annotations: <none>
API Version: kubeflow.org/v1alpha1
Kind: StudyJob
Metadata:
Cluster Name:
Creation Timestamp: 2019-01-15T18:35:20Z
Generation: 1
Resource Version: 1058135
Self Link: /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/pytorchjob-example
UID: 4fc7ad83-18f4-11e9-a6de-42010a8e0225
Spec:
Metricsnames:
accuracy
Objectivevaluename: accuracy
Optimizationgoal: 0.99
Optimizationtype: maximize
Owner: crd
Parameterconfigs:
Feasible:
Max: 0.05
Min: 0.01
Name: --lr
Parametertype: double
Feasible:
Max: 0.9
Min: 0.5
Name: --momentum
Parametertype: double
Requestcount: 4
Study Name: pytorchjob-example
Suggestion Spec:
Request Number: 3
Suggestion Algorithm: random
Suggestion Parameters:
Name: SuggestionCount
Value: 0
Worker Spec:
Go Template:
Raw Template: apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: {{.WorkerID}}
namespace: kubeflow
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: gcr.io/kubeflow-ci/pytorch-mnist-with-summary:1.0
imagePullPolicy: Always
command:
- "python"
- "/opt/pytorch_dist_mnist/dist_mnist_with_summary.py"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: gcr.io/kubeflow-ci/pytorch-mnist-with-summary:1.0
imagePullPolicy: Always
command:
- "python"
- "/opt/pytorch_dist_mnist/dist_mnist_with_summary.py"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
Retain: true
Status:
Conditon: Running
Early Stopping Parameter Id:
Last Reconcile Time: 2019-01-15T18:35:20Z
Start Time: 2019-01-15T18:35:20Z
Studyid: k291b444a0b68631
Suggestion Count: 1
Suggestion Parameter Id: n6f17dd9ff466a2b
Trials:
Trialid: o104235328003ad9
Workeridlist:
Completion Time: <nil>
Conditon: Running
Kind: PyTorchJob
Start Time: 2019-01-15T18:35:20Z
Workerid: b3b371c89144727f
Trialid: ca207b2432231de3
Workeridlist:
Completion Time: <nil>
Conditon: Running
Kind: PyTorchJob
Start Time: 2019-01-15T18:35:20Z
Workerid: f291b04fb27ece3c
Trialid: ddff69212e826432
Workeridlist:
Completion Time: <nil>
Conditon: Running
Kind: PyTorchJob
Start Time: 2019-01-15T18:35:20Z
Workerid: ncbed67bbcd4a8ed
Events: <none>
When the spec.Status.Condition becomes Completed
, the StudyJob is finished.
You can monitor your results in Katib UI. For accessing to Katib UI, you have to install Ambassador.
In your Ksonnet app root, run the following
ks generate ambassador ambassador
ks apply ${KF_ENV} -c ambassador
After this, you have to port-forward Ambassador service
kubectl port-forward svc/ambassador -n kubeflow 8080:80
Finally, you can access to Katib UI using this URL: http://localhost:8080/katib/
.
Delete installed components
ks delete ${KF_ENV} -c katib
ks delete ${KF_ENV} -c pytorch-operator
ks delete ${KF_ENV} -c tf-job-operator
If you create pv for Katib, delete it
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha1/pv/pv.yaml
If you deploy Ambassador, delete it
ks delete ${KF_ENV} -c ambassador
Please feel free to test the system! developer-guide.md is a good starting point for developers.