kubeflow/mxnet-operator

MXjob kubernetes resource was created successfully, but SCHEDULER, SERVER, WORKER Job were not created.

xigang opened this issue · 2 comments

I followed the README to create mxjob, the MXJob kubernetes resource was created successfully, But SCHEDULER, SERVER, WORKER these objects are not created.

kubernetes version: 1.12.2
kubeflow version: 0.3.1
ksonnet version: dev-2018-11-20T22:21:08+0800
jsonnet version: v0.11.2
client-go version: kubernetes-1.10.4

The following information may be useful:

# kubectl  get crd
NAME                     CREATED AT
mxjobs.kubeflow.org      2018-11-21T14:07:37Z
studyjobs.kubeflow.org   2018-11-20T14:35:52Z
tfjobs.kubeflow.org      2018-11-20T14:35:36Z
workflows.argoproj.io    2018-11-20T14:35:44Z
# kubectl  get pods -n kubeflow
NAME                                                      READY   STATUS             RESTARTS   AGE
ambassador-c97f7b448-cdqdx                                3/3     Running            1          7h9m
ambassador-c97f7b448-f8t8v                                3/3     Running            1          7h9m
ambassador-c97f7b448-hlnrg                                3/3     Running            1          7h9m
argo-ui-7495b79b59-96xkq                                  1/1     Running            0          7h9m
centraldashboard-798f8d68d5-swg7v                         1/1     Running            0          7h9m
modeldb-backend-d69695b66-fkxgs                           1/1     Running            0          7h9m
modeldb-db-975db58f7-4dbck                                1/1     Running            0          7h9m
modeldb-frontend-78ccff78b7-pr8kv                         1/1     Running            0          7h9m
mxnet-operator-6c49b767bc-5mpjg                           1/1     Running            0          12m
spartakus-volunteer-ffdfcdb5c-dlvz2                       1/1     Running            0          7h9m
studyjob-controller-7df5754ddf-5fdgk                      1/1     Running            0          7h9m
tf-hub-0                                                  1/1     Running            0          7h8m
tf-job-dashboard-7499d5cbcf-52dfr                         1/1     Running            0          7h9m
tf-job-operator-v1alpha2-644c5f7db7-vnglr                 1/1     Running            0          7h9m
# cat mx_job_dist.yaml
apiVersion: "kubeflow.org/v1alpha1"
kind: "MXJob"
metadata:
  name: "gpu-dist-job"
spec:
  jobMode: "dist"
  replicaSpecs:
    - replicas: 1
      mxReplicaType: SCHEDULER
      PsRootPort: 9000
      template:
        spec:
          containers:
            - image: mxjob/mxnet:gpu
              name: mxnet
          restartPolicy: OnFailure
    - replicas: 2
      mxReplicaType: SERVER
      template:
        spec:
          containers:
            - image: mxjob/mxnet:gpu
              name: mxnet
          restartPolicy: OnFailure
    - replicas: 2
      mxReplicaType: WORKER
      template:
        spec:
          containers:
            - image: mxjob/mxnet:gpu
              name: mxnet
              command: ["python"]
              args: ["/incubator-mxnet/example/image-classification/train_mnist.py","--num-epochs","1","--num-layers","2","--kv-store","dist_device_sync","--gpus","0,1"]
              resources:
                limits:
                  nvidia.com/gpu: 2
          restartPolicy: OnFailure

After the above preparation is completed, creating mxjob gpu-dist-job succeeded, but did not create schedler, server, worker job :(

# kubectl  get mxjobs
NAME           AGE
gpu-dist-job   12m
# kubectl  get jobs
No resources found.

How should I solve it? thx.

@gaocegege @suleisl2000

I tested your script and all things work, except we forget providing the crd for v1alpha1 :-( ...
"kubectl get jobs" will get nothing, and it is right, if you want to get SERVER, SCHEDULER, WORK, you should use "kubectl get pods"

@xigang It seems that you closed this issue. Is there anything we can help you?