Kube-queue is designed to manage AI/ML workloads in a Kubernetes native manner. It allows system admins to customize policy for each queue in the form of plugins so both flexibility and fairness are guaranteed between different queues. Combined with a quota system (like resource quota), resource allocation is automated and optimized to maximize utilization of cluster resources.
- Queue based on priority and creation time
- Support dynamic adjustment of job priority in queue
- Dequeue based on ResourceQuota
- Clone this repo to your machine
$ git clone https://github.com/kube-queue/kube-queue.git
- Change to Kube-queue directory:
$ cd kube-queue
- Deploy Kube-queue with Helm
$ helm install kube-queue -n kube-system ./charts/v0.0.1
NAME: kube-queue
LAST DEPLOYED: Mon Sep 13 10:15:34 2021
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
- Check running status of Kube-queue
$ helm get manifest kube-queue -n kube-system | kubectl get -n kube-queue -f -
NAME STATUS AGE
namespace/kube-queue Active 2m17s
NAME SECRETS AGE
serviceaccount/kube-queue 1 2m16s
NAME CREATED AT
customresourcedefinition.apiextensions.k8s.io/queueunits.scheduling.x-k8s.io 2021-09-13T02:15:36Z
NAME CREATED AT
clusterrole.rbac.authorization.k8s.io/kube-queue 2021-09-13T02:15:36Z
NAME ROLE AGE
clusterrolebinding.rbac.authorization.k8s.io/kube-queue-clusterrolebinding ClusterRole/kube-queue 2m16s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/kube-queue-controller 1/1 1 1 2m16s
deployment.apps/tf-operator-extension 1/1 1 1 2m17s
deployment.apps/pytorch-operator-extesion 1/1 1 1 2m17s
- Uninstall Kube-queue with Helm
$ helm uninstall kube-queue -n kube-system
We will submit two tf jobs to the cluster at the same time, but the current cluster can only meet the resource requests of one job. At this time, it is ensured that one job is running and the other job is queued by Kube-queue, and the pods of the queued job are not created.
1. Deploy tf-operator that can support queue ( Ensure that no other tf-operator is deployed in the cluster)
$ kubectl apply -f examples/tf-operator/
customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org created
serviceaccount/tf-job-operator created
clusterrole.rbac.authorization.k8s.io/tf-job-dashboard created
clusterrole.rbac.authorization.k8s.io/tf-job-operator created
clusterrolebinding.rbac.authorization.k8s.io/tf-job-operator created
deployment.apps/tf-job-operator created
$ kubectl get -f examples/tf-operator/
NAME CREATED AT
customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org 2021-09-13T06:46:39Z
NAME SECRETS AGE
serviceaccount/tf-job-operator 1 9s
NAME CREATED AT
clusterrole.rbac.authorization.k8s.io/tf-job-dashboard 2021-09-13T06:46:39Z
clusterrole.rbac.authorization.k8s.io/tf-job-operator 2021-09-13T06:46:39Z
NAME ROLE AGE
clusterrolebinding.rbac.authorization.k8s.io/tf-job-operator ClusterRole/tf-job-operator 9s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/tf-job-operator 1/1 1 1 8s
$ kubectl create -f examples/tfjob/resource_quota.yaml
resourcequota/default created
$ kubectl get resourcequota default -o wide
NAME AGE REQUEST LIMIT
default 76s cpu: 0/4, memory: 0/4Gi
$ kubectl create -f examples/tfjob/job1.yaml;kubectl create -f examples/tfjob/job2.yaml
tfjob.kubeflow.org/job1 created
tfjob.kubeflow.org/job2 created
5.1 At the beginning, only one job creates the pod and runs successfully.
$ kubectl get tfjob
NAME STATE AGE
job1 Running 5s
job2 5s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
job1-ps-0 1/1 Running 0 8s
job1-worker-0 1/1 Running 0 8s
job1-worker-1 1/1 Running 0 8s
5.2 When the state of job1 is Succeeded. Job2 will continue to run.
$ kubectl get tfjob
NAME STATE AGE
job1 Succeeded 38s
job2 Running 38s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
job1-worker-0 0/1 Completed 0 54s
job1-worker-1 0/1 Completed 0 54s
job2-ps-0 1/1 Running 0 22s
job2-worker-0 1/1 Running 0 22s
job2-worker-1 1/1 Running 0 21s
5.3 Finally, the state of the two jobs are Succeeded.
$ kubectl get tfjob
NAME STATE AGE
job1 Succeeded 71s
job2 Succeeded 71s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
job1-worker-0 0/1 Completed 0 5m
job1-worker-1 0/1 Completed 0 5m
job2-ps-0 0/1 Completed 0 4m28s
job2-worker-0 0/1 Completed 0 4m28s