kubedl-io/kubedl

[BUG] training pods scheduler name overidden by default-scheduler

Closed this issue · 5 comments

Test training workloads with - --gang-scheduler-name=kube-coscheduler

# test yaml
apiVersion: training.kubedl.io/v1alpha1
kind: "TFJob"
metadata:
  name: "mnist"
  namespace: kubedl
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 3
      restartPolicy: Never
      template:
        spec:
          schedulerName: scheduler-plugins-scheduler
          containers:
            - name: tensorflow
              image: kubedl/tf-mnist-with-summaries:1.0
              imagePullPolicy: IfNotPresent
              command:
                - "python"
                - "/var/tf_mnist/mnist_with_summaries.py"
                - "--log_dir=/train/logs"
                - "--learning_rate=0.01"
                - "--batch_size=150"
              volumeMounts:
                - mountPath: "/train"
                  name: "training"
              resources:
                limits:
                  cpu: 2048m
                  memory: 2Gi
                requests:
                  cpu: 1024m
                  memory: 1Gi
          volumes:
            - name: "training"
              hostPath:
                path: /tmp/data
                type: DirectoryOrCreate

the scheduler name of the pods created is always default-scheduler which is not quite flexible

@shinytang6 I see, for now, the schedulerName will be override inside kube-coscheduler implementation though it is specified by user, override schedulerName only when it is empty seems more reasonable, I will fix it.

furthermore, we'd decouple gang-scheduler-name and scheduler-name-to-set concepts that both of them can be set separately.

@shinytang6 hi, I find that schedulerName will only be override when it is empty in latest code, did you deploy a not-update-to-date kubedl version?
https://github.com/kubedl-io/kubedl/blob/master/pkg/job_controller/pod.go#L472

@shinytang6 hi, I find that schedulerName will only be override when it is empty in latest code, did you deploy a not-update-to-date kubedl version? https://github.com/kubedl-io/kubedl/blob/master/pkg/job_controller/pod.go#L472

yes, l also noticed that. l am using docker.io/kubedl/kubedl:0.4.2, seems the latest changing commit is not included

7c2c969#diff-f1298812d743ef9536ccc3a415a283c84baa3ecf38e0c7993b6296fa1f2d3debR474

@shinytang6 hi, I find that schedulerName will only be override when it is empty in latest code, did you deploy a not-update-to-date kubedl version? https://github.com/kubedl-io/kubedl/blob/master/pkg/job_controller/pod.go#L472

yes, l also noticed that. l am using docker.io/kubedl/kubedl:0.4.2, seems the latest changing commit is not included

7c2c969#diff-f1298812d743ef9536ccc3a415a283c84baa3ecf38e0c7993b6296fa1f2d3debR474

I see, I will release a new image soon, including bugfix and some enhancements recently.

Close since this issue has been fixed in master branch