MXjob kubernetes resource was created successfully, but SCHEDULER, SERVER, WORKER Job were not created.
xigang opened this issue · 2 comments
xigang commented
I followed the README to create mxjob, the MXJob kubernetes resource was created successfully, But SCHEDULER, SERVER, WORKER these objects are not created.
kubernetes version: 1.12.2
kubeflow version: 0.3.1
ksonnet version: dev-2018-11-20T22:21:08+0800
jsonnet version: v0.11.2
client-go version: kubernetes-1.10.4
The following information may be useful:
# kubectl get crd
NAME CREATED AT
mxjobs.kubeflow.org 2018-11-21T14:07:37Z
studyjobs.kubeflow.org 2018-11-20T14:35:52Z
tfjobs.kubeflow.org 2018-11-20T14:35:36Z
workflows.argoproj.io 2018-11-20T14:35:44Z
# kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
ambassador-c97f7b448-cdqdx 3/3 Running 1 7h9m
ambassador-c97f7b448-f8t8v 3/3 Running 1 7h9m
ambassador-c97f7b448-hlnrg 3/3 Running 1 7h9m
argo-ui-7495b79b59-96xkq 1/1 Running 0 7h9m
centraldashboard-798f8d68d5-swg7v 1/1 Running 0 7h9m
modeldb-backend-d69695b66-fkxgs 1/1 Running 0 7h9m
modeldb-db-975db58f7-4dbck 1/1 Running 0 7h9m
modeldb-frontend-78ccff78b7-pr8kv 1/1 Running 0 7h9m
mxnet-operator-6c49b767bc-5mpjg 1/1 Running 0 12m
spartakus-volunteer-ffdfcdb5c-dlvz2 1/1 Running 0 7h9m
studyjob-controller-7df5754ddf-5fdgk 1/1 Running 0 7h9m
tf-hub-0 1/1 Running 0 7h8m
tf-job-dashboard-7499d5cbcf-52dfr 1/1 Running 0 7h9m
tf-job-operator-v1alpha2-644c5f7db7-vnglr 1/1 Running 0 7h9m
# cat mx_job_dist.yaml
apiVersion: "kubeflow.org/v1alpha1"
kind: "MXJob"
metadata:
name: "gpu-dist-job"
spec:
jobMode: "dist"
replicaSpecs:
- replicas: 1
mxReplicaType: SCHEDULER
PsRootPort: 9000
template:
spec:
containers:
- image: mxjob/mxnet:gpu
name: mxnet
restartPolicy: OnFailure
- replicas: 2
mxReplicaType: SERVER
template:
spec:
containers:
- image: mxjob/mxnet:gpu
name: mxnet
restartPolicy: OnFailure
- replicas: 2
mxReplicaType: WORKER
template:
spec:
containers:
- image: mxjob/mxnet:gpu
name: mxnet
command: ["python"]
args: ["/incubator-mxnet/example/image-classification/train_mnist.py","--num-epochs","1","--num-layers","2","--kv-store","dist_device_sync","--gpus","0,1"]
resources:
limits:
nvidia.com/gpu: 2
restartPolicy: OnFailure
After the above preparation is completed, creating mxjob gpu-dist-job
succeeded, but did not create schedler, server, worker job :(
# kubectl get mxjobs
NAME AGE
gpu-dist-job 12m
# kubectl get jobs
No resources found.
How should I solve it? thx.
KingOnTheStar commented
I tested your script and all things work, except we forget providing the crd for v1alpha1 :-( ...
"kubectl get jobs" will get nothing, and it is right, if you want to get SERVER, SCHEDULER, WORK, you should use "kubectl get pods"