/kube-queue-2

Primary LanguageGoApache License 2.0Apache-2.0

License Build Status

Kube-queue

Kube-queue is designed to manage AI/ML workloads in a Kubernetes native manner. It allows system admins to customize policy for each queue in the form of plugins so both flexibility and fairness are guaranteed between different queues. Combined with a quota system (like resource quota), resource allocation is automated and optimized to maximize utilization of cluster resources.

Architecture

arch

Key features

  • Queue based on priority and creation time
  • Support dynamic adjustment of job priority in queue
  • Dequeue based on ResourceQuota

Install

  1. Clone this repo to your machine
$ git clone https://github.com/kube-queue/kube-queue.git
  1. Change to Kube-queue directory:
$ cd kube-queue
  1. Deploy Kube-queue with Helm
$ helm install kube-queue -n kube-system ./charts/v0.0.1
NAME: kube-queue
LAST DEPLOYED: Mon Sep 13 10:15:34 2021
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
  1. Check running status of Kube-queue
$ helm get manifest kube-queue  -n kube-system | kubectl get -n kube-queue -f -
NAME                   STATUS   AGE
namespace/kube-queue   Active   2m17s

NAME                        SECRETS   AGE
serviceaccount/kube-queue   1         2m16s

NAME                                                                           CREATED AT
customresourcedefinition.apiextensions.k8s.io/queueunits.scheduling.x-k8s.io   2021-09-13T02:15:36Z

NAME                                               CREATED AT
clusterrole.rbac.authorization.k8s.io/kube-queue   2021-09-13T02:15:36Z

NAME                                                                         ROLE                     AGE
clusterrolebinding.rbac.authorization.k8s.io/kube-queue-clusterrolebinding   ClusterRole/kube-queue   2m16s

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/kube-queue-controller       1/1     1            1           2m16s
deployment.apps/tf-operator-extension       1/1     1            1           2m17s
deployment.apps/pytorch-operator-extesion   1/1     1            1           2m17s
  1. Uninstall Kube-queue with Helm
$ helm uninstall kube-queue -n kube-system

Example

We will submit two tf jobs to the cluster at the same time, but the current cluster can only meet the resource requests of one job. At this time, it is ensured that one job is running and the other job is queued by Kube-queue, and the pods of the queued job are not created.

1. Deploy tf-operator that can support queue ( Ensure that no other tf-operator is deployed in the cluster)

$ kubectl apply -f examples/tf-operator/
customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org created
serviceaccount/tf-job-operator created
clusterrole.rbac.authorization.k8s.io/tf-job-dashboard created
clusterrole.rbac.authorization.k8s.io/tf-job-operator created
clusterrolebinding.rbac.authorization.k8s.io/tf-job-operator created
deployment.apps/tf-job-operator created

2. Check running status of tf-operator

$ kubectl get -f examples/tf-operator/
NAME                                                                CREATED AT
customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org   2021-09-13T06:46:39Z

NAME                             SECRETS   AGE
serviceaccount/tf-job-operator   1         9s

NAME                                                     CREATED AT
clusterrole.rbac.authorization.k8s.io/tf-job-dashboard   2021-09-13T06:46:39Z
clusterrole.rbac.authorization.k8s.io/tf-job-operator    2021-09-13T06:46:39Z

NAME                                                           ROLE                          AGE
clusterrolebinding.rbac.authorization.k8s.io/tf-job-operator   ClusterRole/tf-job-operator   9s

NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/tf-job-operator   1/1     1            1           8s

3. Create ResourceQuota for default namespace

$ kubectl create -f examples/tfjob/resource_quota.yaml
resourcequota/default created

$ kubectl get resourcequota default -o wide
NAME      AGE   REQUEST                   LIMIT
default   76s   cpu: 0/4, memory: 0/4Gi

4. Submit tf jobs

$ kubectl create -f examples/tfjob/job1.yaml;kubectl create -f examples/tfjob/job2.yaml
tfjob.kubeflow.org/job1 created
tfjob.kubeflow.org/job2 created

5. Check the status of tf jobs

5.1 At the beginning, only one job creates the pod and runs successfully.

$ kubectl get tfjob
NAME   STATE     AGE
job1   Running   5s
job2             5s

$ kubectl get pods
NAME            READY   STATUS    RESTARTS   AGE
job1-ps-0       1/1     Running   0          8s
job1-worker-0   1/1     Running   0          8s
job1-worker-1   1/1     Running   0          8s

5.2 When the state of job1 is Succeeded. Job2 will continue to run.

$ kubectl get tfjob
NAME   STATE       AGE
job1   Succeeded   38s
job2   Running     38s

$ kubectl get pods
NAME            READY   STATUS      RESTARTS   AGE
job1-worker-0   0/1     Completed   0          54s
job1-worker-1   0/1     Completed   0          54s
job2-ps-0       1/1     Running     0          22s
job2-worker-0   1/1     Running     0          22s
job2-worker-1   1/1     Running     0          21s

5.3 Finally, the state of the two jobs are Succeeded.

$ kubectl get tfjob
NAME   STATE       AGE
job1   Succeeded   71s
job2   Succeeded   71s

$ kubectl get pods
NAME            READY   STATUS      RESTARTS   AGE
job1-worker-0   0/1     Completed   0          5m
job1-worker-1   0/1     Completed   0          5m
job2-ps-0       0/1     Completed   0          4m28s
job2-worker-0   0/1     Completed   0          4m28s