[BUG] training pods scheduler name overidden by default-scheduler
Closed this issue · 5 comments
Test training workloads with - --gang-scheduler-name=kube-coscheduler
# test yaml
apiVersion: training.kubedl.io/v1alpha1
kind: "TFJob"
metadata:
name: "mnist"
namespace: kubedl
spec:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 3
restartPolicy: Never
template:
spec:
schedulerName: scheduler-plugins-scheduler
containers:
- name: tensorflow
image: kubedl/tf-mnist-with-summaries:1.0
imagePullPolicy: IfNotPresent
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
- "--log_dir=/train/logs"
- "--learning_rate=0.01"
- "--batch_size=150"
volumeMounts:
- mountPath: "/train"
name: "training"
resources:
limits:
cpu: 2048m
memory: 2Gi
requests:
cpu: 1024m
memory: 1Gi
volumes:
- name: "training"
hostPath:
path: /tmp/data
type: DirectoryOrCreate
the scheduler name of the pods created is always default-scheduler
which is not quite flexible
@shinytang6 I see, for now, the schedulerName will be override inside kube-coscheduler implementation though it is specified by user, override schedulerName only when it is empty seems more reasonable, I will fix it.
furthermore, we'd decouple gang-scheduler-name
and scheduler-name-to-set
concepts that both of them can be set separately.
@shinytang6 hi, I find that schedulerName
will only be override when it is empty in latest code, did you deploy a not-update-to-date kubedl version?
https://github.com/kubedl-io/kubedl/blob/master/pkg/job_controller/pod.go#L472
@shinytang6 hi, I find that
schedulerName
will only be override when it is empty in latest code, did you deploy a not-update-to-date kubedl version? https://github.com/kubedl-io/kubedl/blob/master/pkg/job_controller/pod.go#L472
yes, l also noticed that. l am using docker.io/kubedl/kubedl:0.4.2
, seems the latest changing commit is not included
7c2c969#diff-f1298812d743ef9536ccc3a415a283c84baa3ecf38e0c7993b6296fa1f2d3debR474
@shinytang6 hi, I find that
schedulerName
will only be override when it is empty in latest code, did you deploy a not-update-to-date kubedl version? https://github.com/kubedl-io/kubedl/blob/master/pkg/job_controller/pod.go#L472yes, l also noticed that. l am using
docker.io/kubedl/kubedl:0.4.2
, seems the latest changing commit is not included7c2c969#diff-f1298812d743ef9536ccc3a415a283c84baa3ecf38e0c7993b6296fa1f2d3debR474
I see, I will release a new image soon, including bugfix and some enhancements recently.
Close since this issue has been fixed in master branch