Katib is a Kubernetes-based system for Hyperparameter Tuning and Neural Architecture Search. Katib supports a number of ML frameworks, including TensorFlow, Apache MXNet, PyTorch, XGBoost, and others.
- Getting Started
- Name
- Concepts in Katib
- Components in Katib
- Web UI
- API documentation
- Installation
- Katib SDK
- Quick Start
- Who are using Katib?
- Citation
- CONTRIBUTING
Created by gh-md-toc
See the getting-started guide on the Kubeflow website.
Katib stands for secretary
in Arabic.
For a detailed description of the concepts in Katib, hyperparameter tuning, and neural architecture search, see the Kubeflow documentation.
Katib has the concepts of Experiment, Trial, Job and Suggestion.
Experiment
represents a single optimization run over a feasible space.
Each Experiment
contains a configuration:
- Objective: What we are trying to optimize.
- Search Space: Constraints for configurations describing the feasible space.
- Search Algorithm: How to find the optimal configurations.
Experiment
is defined as a CRD. See the detailed guide to configuring and running a Katib
experiment
in the Kubeflow docs.
A Suggestion is a proposed solution to the optimization problem which is one set of hyperparameter values or a list of parameter assignments. Then a Trial
will be created to evaluate the parameter assignments.
Suggestion
is defined as a CRD.
A Trial
is one iteration of the optimization process, which is one worker job
instance with a list of parameter assignments(corresponding to a suggestion).
Trial
is defined as a CRD.
A Worker Job
refers to a process responsible for evaluating a Trial
and calculating its objective value.
The worker kind can be Kubernetes Job which is a non distributed execution, Kubeflow TFJob or Kubeflow PyTorchJob which are distributed executions. Thus, Katib supports multiple frameworks with the help of different job kinds.
Currently Katib supports the following exploration algorithms:
- Random Search
- Tree of Parzen Estimators (TPE)
- Grid Search
- Hyperband
- Bayesian Optimization
- CMA Evolution Strategy
Katib consists of several components as shown below. Each component is running on k8s as a deployment.
Each component communicates with others via GRPC and the API is defined at pkg/apis/manager/v1beta1/api.proto
for v1beta1 version and pkg/apis/manager/v1alpha3/api.proto
for v1alpha3 version.
- Katib main components:
- katib-db-manager: GRPC API server of Katib which is the DB Interface.
- katib-mysql: Data storage backend of Katib using mysql.
- katib-ui: User interface of Katib.
- katib-controller: Controller for Katib CRDs in Kubernetes.
Katib provides a Web UI. You can visualize general trend of Hyper parameter space and each training history. You can use random-example or other examples to generate a similar UI.
See the Katib v1beta1 API reference docs.
See the Katib v1alpha3 API reference docs.
For standard installation of Katib with support for all job operators, install Kubeflow. Current official Katib version in Kubeflow latest release is v1alpha3. See the documentation:
If you install Katib with other Kubeflow components, you can't submit Katib jobs in Kubeflow namespace.
Alternatively, if you want to install Katib manually with TF and PyTorch operators support, follow these steps:
Create Kubeflow namespace:
kubectl create namespace kubeflow
Clone Kubeflow manifest repository:
git clone git@github.com:kubeflow/manifests.git
Set `MANIFESTS_DIR` to the cloned folder.
export MANIFESTS_DIR=<cloned-folder>
For installing TF operator, run the following:
cd "${MANIFESTS_DIR}/tf-training/tf-job-crds/base"
kustomize build . | kubectl apply -f -
cd "${MANIFESTS_DIR}/tf-training/tf-job-operator/base"
kustomize build . | kubectl apply -n kubeflow -f -
For installing PyTorch operator, run the following:
cd "${MANIFESTS_DIR}/pytorch-job/pytorch-job-crds/base"
kustomize build . | kubectl apply -f -
cd "${MANIFESTS_DIR}/pytorch-job/pytorch-operator/base/"
kustomize build . | kubectl apply -n kubeflow -f -
Finally, you can install Katib.
For v1beta1 version, run the following:
git clone git@github.com:kubeflow/katib.git
bash katib/scripts/v1beta1/deploy.sh
For v1alpha3 version, run the following:
cd "${MANIFESTS_DIR}/katib/katib-crds/base"
kustomize build . | kubectl apply -f -
cd "${MANIFESTS_DIR}/katib/katib-controller/base"
kustomize build . | kubectl apply -f -
If you install Katib from Kubeflow manifest repository and you want to use Katib in a cluster that doesn't have a StorageClass for dynamic volume provisioning, you have to create persistent volume manually to bound your persistent volume claim.
This is sample yaml file for creating a persistent volume with local storage:
apiVersion: v1
kind: PersistentVolume
metadata:
name: katib-mysql
labels:
type: local
app: katib
spec:
storageClassName: katib
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /tmp/katib
Create this PV after deploying Katib package
Check if all components are running successfully:
kubectl get pods -n kubeflow
Expected output:
NAME READY STATUS RESTARTS AGE
katib-controller-858d6cc48c-df9jc 1/1 Running 1 20m
katib-db-manager-7966fbdf9b-w2tn8 1/1 Running 0 20m
katib-mysql-7f8bc6956f-898f9 1/1 Running 0 20m
katib-ui-7cf9f967bf-nm72p 1/1 Running 0 20m
pytorch-operator-55f966b548-9gq9v 1/1 Running 0 20m
tf-job-operator-796b4747d8-4fh82 1/1 Running 0 21m
After deploy everything, you can run examples to verify the installation. Examples bellow are for v1beta1 version.
This is example for TF operator:
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/tfjob-example.yaml
This is example for PyTorch operator:
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/pytorchjob-example.yaml
You can check status of experiment
$ kubectl describe experiment tfjob-example -n kubeflow
Name: tfjob-example
Namespace: kubeflow
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v1beta1
Kind: Experiment
Metadata:
Creation Timestamp: 2020-07-15T14:27:53Z
Finalizers:
update-prometheus-metrics
Generation: 1
Resource Version: 100380029
Self Link: /apis/kubeflow.org/v1beta1/namespaces/kubeflow/experiments/tfjob-example
UID: 5e3cf1f5-c6a7-11ea-90dd-42010a9a0020
Spec:
Algorithm:
Algorithm Name: random
Max Failed Trial Count: 3
Max Trial Count: 12
Metrics Collector Spec:
Collector:
Kind: TensorFlowEvent
Source:
File System Path:
Kind: Directory
Path: /train
Objective:
Goal: 0.99
Metric Strategies:
Name: accuracy_1
Value: max
Objective Metric Name: accuracy_1
Type: maximize
Parallel Trial Count: 3
Parameters:
Feasible Space:
Max: 0.05
Min: 0.01
Name: learning_rate
Parameter Type: double
Feasible Space:
Max: 200
Min: 100
Name: batch_size
Parameter Type: int
Resume Policy: LongRunning
Trial Template:
Trial Parameters:
Description: Learning rate for the training model
Name: learningRate
Reference: learning_rate
Description: Batch Size
Name: batchSize
Reference: batch_size
Trial Spec:
API Version: kubeflow.org/v1
Kind: TFJob
Spec:
Tf Replica Specs:
Worker:
Replicas: 2
Restart Policy: OnFailure
Template:
Spec:
Containers:
Command:
python
/var/tf_mnist/mnist_with_summaries.py
--log_dir=/train/metrics
--learning_rate=${trialParameters.learningRate}
--batch_size=${trialParameters.batchSize}
Image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
Image Pull Policy: Always
Name: tensorflow
Status:
Completion Time: 2020-07-15T14:30:52Z
Conditions:
Last Transition Time: 2020-07-15T14:27:53Z
Last Update Time: 2020-07-15T14:27:53Z
Message: Experiment is created
Reason: ExperimentCreated
Status: True
Type: Created
Last Transition Time: 2020-07-15T14:30:52Z
Last Update Time: 2020-07-15T14:30:52Z
Message: Experiment is running
Reason: ExperimentRunning
Status: False
Type: Running
Last Transition Time: 2020-07-15T14:30:52Z
Last Update Time: 2020-07-15T14:30:52Z
Message: Experiment has succeeded because Objective goal has reached
Reason: ExperimentGoalReached
Status: True
Type: Succeeded
Current Optimal Trial:
Best Trial Name: tfjob-example-gjxn54vl
Observation:
Metrics:
Latest: 0.966300010681
Max: 1.0
Min: 0.103260867298
Name: accuracy_1
Parameter Assignments:
Name: learning_rate
Value: 0.015945204040626416
Name: batch_size
Value: 184
Start Time: 2020-07-15T14:27:53Z
Succeeded Trial List:
tfjob-example-5jd8nnjg
tfjob-example-bgjfpd5t
tfjob-example-gjxn54vl
tfjob-example-vpdqxkch
tfjob-example-wvptx7gt
Trials: 5
Trials Succeeded: 5
Events: <none>
When the spec.Status.Condition becomes Succeeded
, the experiment is finished.
You can monitor your results in Katib UI.
Access Katib UI via Kubeflow dashboard if you have used standard installation or port-forward the katib-ui
service if you have installed manually.
kubectl -n kubeflow port-forward svc/katib-ui 8080:80
You can access the Katib UI using this URL: http://localhost:8080/katib/
.
Katib supports Python SDK for v1beta1 and v1alpha3 version.
-
See the Katib v1beta1 SDK documentation.
-
See the Katib v1alpha3 SDK documentation.
Run gen_sdk.sh
to update SDK.
To delete installed TF and PyTorch operator run kubectl delete -f
on the respective folders.
To delete Katib for v1beta1 version run bash katib/scripts/v1beta1/undeploy.sh
.
Please see Quick Start Guide.
Please see adopters.md.
Please feel free to test the system! developer-guide.md is a good starting point for developers.
If you use Katib in a scientific publication, we would appreciate citations to the following paper:
A Scalable and Cloud-Native Hyperparameter Tuning System, George et al., arXiv:2006.02085, 2020.
Bibtex entry:
@misc{george2020katib,
title={A Scalable and Cloud-Native Hyperparameter Tuning System},
author={Johnu George and Ce Gao and Richard Liu and Hou Gang Liu and Yuan Tang and Ramdoot Pydipaty and Amit Kumar Saha},
year={2020},
eprint={2006.02085},
archivePrefix={arXiv},
primaryClass={cs.DC}
}