kubeflow/mxnet-operator

Customizing environment variables of the pods in mxnetjob crashes the operator

stsukrov opened this issue · 6 comments

apiVersion: kubeflow.org/v1alpha1
kind: MXJob
metadata:
  name: mxnet-gpu-dist-job
spec:
  jobMode: dist
  replicaSpecs:
  - mxReplicaType: SCHEDULER
    PsRootPort: 9000
    replicas: 1
    template:
      spec:
        containers:
        - image: stsukrov/mxnetbench
          name: mxnet
          env:
          - name: PS_VERBOSE
            value: "2"
        restartPolicy: OnFailure
  - mxReplicaType: SERVER
    replicas: 2
    template:
      spec:
        containers:
        - image: stsukrov/mxnetbench
          name: mxnet
#          env:
#          - name: PS_VERBOSE
#            value: "2"
  - mxReplicaType: WORKER
    replicas: 4
    template:
      spec:
        containers:
        - image: stsukrov/mxnetbench
          args:
            - /incubator-mxnet/example/image-classification/train_imagenet.py
            - --num-epochs
            - '1'
            - --benchmark
            - '1'
            - --kv-store
            - dist_device_sync
            - --network
            - inception-v3
            - --batch-size
            - '64'
            - --image-shape
            - '3,299,299'
            - --gpus
            - '0'
          command:
            - python
#          env:
#            - name: PS_VERBOSE
#              value: "2"
          name: mxnet
          resources:
            limits:
                nvidia.com/gpu: 1
        restartPolicy: OnFailure

Enabling PS_VERBOSE on any of the pods crashes the operator:

f018986aae72:baictl stsukrov$ kubectl logs mxnet-operator-f46557c4f-wfklx
{"filename":"app/server.go:64","level":"info","msg":"KUBEFLOW_NAMESPACE not set, using default namespace","time":"2019-03-26T09:33:49Z"}
{"filename":"app/server.go:69","level":"info","msg":"[API Version: v1alpha1 Version: v0.1.0-alpha Git SHA: Not provided. Go Version: go1.10.2 Go OS/Arch: linux/amd64]","time":"2019-03-26T09:33:49Z"}
{"filename":"app/server.go:153","level":"info","msg":"No controller_config_file provided; using empty config.","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:138","level":"info","msg":"Setting up event handlers","time":"2019-03-26T09:33:49Z"}
I0326 09:33:49.093823       1 leaderelection.go:174] attempting to acquire leader lease...
E0326 09:33:49.115391       1 event.go:260] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"mxnet-operator", GenerateName:"", Namespace:"default", SelfLink:"/api/v1/namespaces/default/endpoints/mxnet-operator", UID:"c7a02cee-4efb-11e9-a1a1-025004746b4c", ResourceVersion:"204330", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63689114690, loc:(*time.Location)(0x18bc1a0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"mxnet-operator-f46557c4f-wfklx\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2019-03-25T12:44:50Z\",\"renewTime\":\"2019-03-26T09:33:49Z\",\"leaderTransitions\":0}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'no kind is registered for the type v1.Endpoints'. Will not report event: 'Normal' 'LeaderElection' 'mxnet-operator-f46557c4f-wfklx became leader'
I0326 09:33:49.115722       1 leaderelection.go:184] successfully acquired lease default/mxnet-operator
{"filename":"controller/controller.go:176","level":"info","msg":"Starting MXJob controller","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:179","level":"info","msg":"Waiting for informer caches to sync","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:184","level":"info","msg":"Starting 1 workers","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:190","level":"info","msg":"Started workers","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:273","job":"default/mxnet-gpu-dist-job","level":"info","msg":"Creating new job default/mxnet-gpu-dist-job","time":"2019-03-26T09:33:49Z"}
{"filename":"trainer/replicas.go:507","job":"default/mxnet-gpu-dist-job","job_type":"SCHEDULER","level":"info","msg":"Job mxnet-gpu-dist-job missing pod for replica SCHEDULER index 0, creating a new one.","mx_job_name":"mxnet-gpu-dist-job","runtime_id":"pub8","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:245","job":"default/mxnet-gpu-dist-job","level":"info","msg":"Finished syncing job \"default/mxnet-gpu-dist-job\" (7.526234ms)","time":"2019-03-26T09:33:49Z"}
E0326 09:33:49.223718       1 runtime.go:66] Observed a panic: "index out of range" (runtime error: index out of range)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/home/tusimple/go/src/runtime/asm_amd64.s:573
/home/tusimple/go/src/runtime/panic.go:502
/home/tusimple/go/src/runtime/panic.go:28
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:218
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:509
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/training.go:362
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:291
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:162
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:215
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:201
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:187
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/home/tusimple/go/src/runtime/asm_amd64.s:2361
panic: runtime error: index out of range [recovered]
	panic: runtime error: index out of range

goroutine 102 [running]:
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0xfb2f60, 0x18aabd0)
	/home/tusimple/go/src/runtime/panic.go:502 +0x229
github.com/kubeflow/mxnet-operator/pkg/trainer.(*MXReplicaSet).CreatePodWithIndex(0xc42041f980, 0x0, 0x3f, 0xc4205ab378, 0x3)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:218 +0x11cd
github.com/kubeflow/mxnet-operator/pkg/trainer.(*MXReplicaSet).SyncPods(0xc42041f980, 0x0, 0x0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:509 +0x3a8
github.com/kubeflow/mxnet-operator/pkg/trainer.(*TrainingJob).Reconcile(0xc420692370, 0xc4205dc1d0, 0xc420711300, 0x1a, 0xc420358158)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/training.go:362 +0x11f
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).syncMXJob(0xc4205dc1b0, 0xc420711300, 0x1a, 0xc420086c00, 0x0, 0x0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:291 +0xaee
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).(github.com/kubeflow/mxnet-operator/pkg/controller.syncMXJob)-fm(0xc420711300, 0x1a, 0xc4205bd380, 0xf58b20, 0xc4203227c0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:162 +0x3e
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).processNextWorkItem(0xc4205dc1b0, 0xc4205ca100)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:215 +0xee
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).runWorker(0xc4205dc1b0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:201 +0x2b
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).(github.com/kubeflow/mxnet-operator/pkg/controller.runWorker)-fm()
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:187 +0x2a
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc42042e5b0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc42042e5b0, 0x3b9aca00, 0x0, 0x1, 0xc4205e6600)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc42042e5b0, 0x3b9aca00, 0xc4205e6600)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).Run
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:187 +0x22b

@stsukrov Hi Stanislav, I tested your case with latest mxnet-operator and I failed to reproduce the issue. I suspect v1beta1 has already fixed the bug. Could you help to test your case with v1beta1 operator? BTW, v1alpha1 version is deprecated and I didn't keep back compatibility due to short of resource to maintain.

Below is my profile of tested job. You can see I append "env" for all pods, while the job is successfully completed.

leisu@desk-73:/extend/Workspace/myproj/src/github.com/kubeflow/mxnet-operator/manifests$ kubectl get mxjob mxnet-job -o yaml
apiVersion: kubeflow.org/v1beta1
kind: MXJob
metadata:
  creationTimestamp: 2019-04-02T08:51:55Z
  generation: 1
  name: mxnet-job
  namespace: default
  resourceVersion: "133831"
  selfLink: /apis/kubeflow.org/v1beta1/namespaces/default/mxjobs/mxnet-job
  uid: 9125369d-5524-11e9-b812-704d7bb59f71
spec:
  cleanPodPolicy: All
  jobMode: MXTrain
  mxReplicaSpecs:
    Scheduler:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - env:
            - name: PS_VERBOSE
              value: "2"
            image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources: {}
    Server:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - env:
            - name: PS_VERBOSE
              value: "2"
            image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources: {}
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - args:
            - /incubator-mxnet/example/image-classification/train_mnist.py
            - --num-epochs
            - "30"
            - --num-layers
            - "2"
            - --kv-store
            - dist_device_sync
            - --gpus
            - "0"
            command:
            - python
            env:
            - name: PS_VERBOSE
              value: "2"
            image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources:
              limits:
                nvidia.com/gpu: "1"
status:
  completionTime: 2019-04-02T08:53:12Z
  conditions:
  - lastTransitionTime: 2019-04-02T08:51:55Z
    lastUpdateTime: 2019-04-02T08:51:55Z
    message: MXJob mxnet-job is created.
    reason: MXJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: 2019-04-02T08:51:55Z
    lastUpdateTime: 2019-04-02T08:51:57Z
    message: MXJob mxnet-job is running.
    reason: MXJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: 2019-04-02T08:51:55Z
    lastUpdateTime: 2019-04-02T08:53:12Z
    message: MXJob mxnet-job is successfully completed.
    reason: MXJobSucceeded
    status: "True"
    type: Succeeded
  mxReplicaStatuses:
    Scheduler: {}
    Server: {}
    Worker: {}
  startTime: 2019-04-02T08:51:57Z

@suleisl2000

Thanks a lot. We should really update our staff.

As far I can see, kubeflow 0.4.1 still deploys mxnet-operator/v1.
Do you recommend to use the v1beta1?

@stsukrov Yes, it is better to use the v1beta1. You can upgrade to v1beta1 bypass ks command with following steps

  1. Stop and delete your current mxnet-operator deployment and crd
  2. Download latest source code and install mxnet-operator as follow
  kubectl create -f manifests/crd-v1beta1.yaml 
  kubectl create -f manifests/rbac.yaml 
  kubectl create -f manifests/deployment.yaml

Please feel free to reach out me for any issue.

@suleisl2000

Thanks. Currently we do a similar "git install" with mpi-job, since ksonnet was not a part of our infra.
We were uncomfortable doing the same with mxnet-job, but looks, like it's a better option, before the code is properly released.

What's the current state of ksonnet in kubeflow? "git install" is hardly a long term solution, right?

@stsukrov Yes, I agree with you that "git install" is not a long term solution. I fixed the problem of kubeflow integration at PR#2909 and am waiting for the feedback from community whether I can merge the fix to v0.5-branch so that it can be available at v0.5.0 release, please refer to issue #2797 for details.
BTW, I think it is a good feature if kubeflow can support to apply a hotfix for operators gracefully.

Thanks for the info!