Customizing environment variables of the pods in mxnetjob crashes the operator
stsukrov opened this issue · 6 comments
apiVersion: kubeflow.org/v1alpha1
kind: MXJob
metadata:
name: mxnet-gpu-dist-job
spec:
jobMode: dist
replicaSpecs:
- mxReplicaType: SCHEDULER
PsRootPort: 9000
replicas: 1
template:
spec:
containers:
- image: stsukrov/mxnetbench
name: mxnet
env:
- name: PS_VERBOSE
value: "2"
restartPolicy: OnFailure
- mxReplicaType: SERVER
replicas: 2
template:
spec:
containers:
- image: stsukrov/mxnetbench
name: mxnet
# env:
# - name: PS_VERBOSE
# value: "2"
- mxReplicaType: WORKER
replicas: 4
template:
spec:
containers:
- image: stsukrov/mxnetbench
args:
- /incubator-mxnet/example/image-classification/train_imagenet.py
- --num-epochs
- '1'
- --benchmark
- '1'
- --kv-store
- dist_device_sync
- --network
- inception-v3
- --batch-size
- '64'
- --image-shape
- '3,299,299'
- --gpus
- '0'
command:
- python
# env:
# - name: PS_VERBOSE
# value: "2"
name: mxnet
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: OnFailure
Enabling PS_VERBOSE on any of the pods crashes the operator:
f018986aae72:baictl stsukrov$ kubectl logs mxnet-operator-f46557c4f-wfklx
{"filename":"app/server.go:64","level":"info","msg":"KUBEFLOW_NAMESPACE not set, using default namespace","time":"2019-03-26T09:33:49Z"}
{"filename":"app/server.go:69","level":"info","msg":"[API Version: v1alpha1 Version: v0.1.0-alpha Git SHA: Not provided. Go Version: go1.10.2 Go OS/Arch: linux/amd64]","time":"2019-03-26T09:33:49Z"}
{"filename":"app/server.go:153","level":"info","msg":"No controller_config_file provided; using empty config.","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:138","level":"info","msg":"Setting up event handlers","time":"2019-03-26T09:33:49Z"}
I0326 09:33:49.093823 1 leaderelection.go:174] attempting to acquire leader lease...
E0326 09:33:49.115391 1 event.go:260] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"mxnet-operator", GenerateName:"", Namespace:"default", SelfLink:"/api/v1/namespaces/default/endpoints/mxnet-operator", UID:"c7a02cee-4efb-11e9-a1a1-025004746b4c", ResourceVersion:"204330", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63689114690, loc:(*time.Location)(0x18bc1a0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"mxnet-operator-f46557c4f-wfklx\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2019-03-25T12:44:50Z\",\"renewTime\":\"2019-03-26T09:33:49Z\",\"leaderTransitions\":0}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'no kind is registered for the type v1.Endpoints'. Will not report event: 'Normal' 'LeaderElection' 'mxnet-operator-f46557c4f-wfklx became leader'
I0326 09:33:49.115722 1 leaderelection.go:184] successfully acquired lease default/mxnet-operator
{"filename":"controller/controller.go:176","level":"info","msg":"Starting MXJob controller","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:179","level":"info","msg":"Waiting for informer caches to sync","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:184","level":"info","msg":"Starting 1 workers","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:190","level":"info","msg":"Started workers","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:273","job":"default/mxnet-gpu-dist-job","level":"info","msg":"Creating new job default/mxnet-gpu-dist-job","time":"2019-03-26T09:33:49Z"}
{"filename":"trainer/replicas.go:507","job":"default/mxnet-gpu-dist-job","job_type":"SCHEDULER","level":"info","msg":"Job mxnet-gpu-dist-job missing pod for replica SCHEDULER index 0, creating a new one.","mx_job_name":"mxnet-gpu-dist-job","runtime_id":"pub8","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:245","job":"default/mxnet-gpu-dist-job","level":"info","msg":"Finished syncing job \"default/mxnet-gpu-dist-job\" (7.526234ms)","time":"2019-03-26T09:33:49Z"}
E0326 09:33:49.223718 1 runtime.go:66] Observed a panic: "index out of range" (runtime error: index out of range)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/home/tusimple/go/src/runtime/asm_amd64.s:573
/home/tusimple/go/src/runtime/panic.go:502
/home/tusimple/go/src/runtime/panic.go:28
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:218
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:509
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/training.go:362
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:291
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:162
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:215
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:201
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:187
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/home/tusimple/go/src/runtime/asm_amd64.s:2361
panic: runtime error: index out of range [recovered]
panic: runtime error: index out of range
goroutine 102 [running]:
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0xfb2f60, 0x18aabd0)
/home/tusimple/go/src/runtime/panic.go:502 +0x229
github.com/kubeflow/mxnet-operator/pkg/trainer.(*MXReplicaSet).CreatePodWithIndex(0xc42041f980, 0x0, 0x3f, 0xc4205ab378, 0x3)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:218 +0x11cd
github.com/kubeflow/mxnet-operator/pkg/trainer.(*MXReplicaSet).SyncPods(0xc42041f980, 0x0, 0x0)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:509 +0x3a8
github.com/kubeflow/mxnet-operator/pkg/trainer.(*TrainingJob).Reconcile(0xc420692370, 0xc4205dc1d0, 0xc420711300, 0x1a, 0xc420358158)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/training.go:362 +0x11f
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).syncMXJob(0xc4205dc1b0, 0xc420711300, 0x1a, 0xc420086c00, 0x0, 0x0)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:291 +0xaee
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).(github.com/kubeflow/mxnet-operator/pkg/controller.syncMXJob)-fm(0xc420711300, 0x1a, 0xc4205bd380, 0xf58b20, 0xc4203227c0)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:162 +0x3e
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).processNextWorkItem(0xc4205dc1b0, 0xc4205ca100)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:215 +0xee
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).runWorker(0xc4205dc1b0)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:201 +0x2b
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).(github.com/kubeflow/mxnet-operator/pkg/controller.runWorker)-fm()
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:187 +0x2a
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc42042e5b0)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc42042e5b0, 0x3b9aca00, 0x0, 0x1, 0xc4205e6600)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc42042e5b0, 0x3b9aca00, 0xc4205e6600)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).Run
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:187 +0x22b
@stsukrov Hi Stanislav, I tested your case with latest mxnet-operator and I failed to reproduce the issue. I suspect v1beta1 has already fixed the bug. Could you help to test your case with v1beta1 operator? BTW, v1alpha1 version is deprecated and I didn't keep back compatibility due to short of resource to maintain.
Below is my profile of tested job. You can see I append "env" for all pods, while the job is successfully completed.
leisu@desk-73:/extend/Workspace/myproj/src/github.com/kubeflow/mxnet-operator/manifests$ kubectl get mxjob mxnet-job -o yaml
apiVersion: kubeflow.org/v1beta1
kind: MXJob
metadata:
creationTimestamp: 2019-04-02T08:51:55Z
generation: 1
name: mxnet-job
namespace: default
resourceVersion: "133831"
selfLink: /apis/kubeflow.org/v1beta1/namespaces/default/mxjobs/mxnet-job
uid: 9125369d-5524-11e9-b812-704d7bb59f71
spec:
cleanPodPolicy: All
jobMode: MXTrain
mxReplicaSpecs:
Scheduler:
replicas: 1
restartPolicy: Never
template:
metadata:
creationTimestamp: null
spec:
containers:
- env:
- name: PS_VERBOSE
value: "2"
image: mxjob/mxnet:gpu
name: mxnet
ports:
- containerPort: 9091
name: mxjob-port
resources: {}
Server:
replicas: 1
restartPolicy: Never
template:
metadata:
creationTimestamp: null
spec:
containers:
- env:
- name: PS_VERBOSE
value: "2"
image: mxjob/mxnet:gpu
name: mxnet
ports:
- containerPort: 9091
name: mxjob-port
resources: {}
Worker:
replicas: 1
restartPolicy: Never
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- /incubator-mxnet/example/image-classification/train_mnist.py
- --num-epochs
- "30"
- --num-layers
- "2"
- --kv-store
- dist_device_sync
- --gpus
- "0"
command:
- python
env:
- name: PS_VERBOSE
value: "2"
image: mxjob/mxnet:gpu
name: mxnet
ports:
- containerPort: 9091
name: mxjob-port
resources:
limits:
nvidia.com/gpu: "1"
status:
completionTime: 2019-04-02T08:53:12Z
conditions:
- lastTransitionTime: 2019-04-02T08:51:55Z
lastUpdateTime: 2019-04-02T08:51:55Z
message: MXJob mxnet-job is created.
reason: MXJobCreated
status: "True"
type: Created
- lastTransitionTime: 2019-04-02T08:51:55Z
lastUpdateTime: 2019-04-02T08:51:57Z
message: MXJob mxnet-job is running.
reason: MXJobRunning
status: "False"
type: Running
- lastTransitionTime: 2019-04-02T08:51:55Z
lastUpdateTime: 2019-04-02T08:53:12Z
message: MXJob mxnet-job is successfully completed.
reason: MXJobSucceeded
status: "True"
type: Succeeded
mxReplicaStatuses:
Scheduler: {}
Server: {}
Worker: {}
startTime: 2019-04-02T08:51:57Z
Thanks a lot. We should really update our staff.
As far I can see, kubeflow 0.4.1 still deploys mxnet-operator/v1.
Do you recommend to use the v1beta1?
@stsukrov Yes, it is better to use the v1beta1. You can upgrade to v1beta1 bypass ks command with following steps
- Stop and delete your current mxnet-operator deployment and crd
- Download latest source code and install mxnet-operator as follow
kubectl create -f manifests/crd-v1beta1.yaml
kubectl create -f manifests/rbac.yaml
kubectl create -f manifests/deployment.yaml
Please feel free to reach out me for any issue.
Thanks. Currently we do a similar "git install" with mpi-job, since ksonnet was not a part of our infra.
We were uncomfortable doing the same with mxnet-job, but looks, like it's a better option, before the code is properly released.
What's the current state of ksonnet in kubeflow? "git install" is hardly a long term solution, right?
@stsukrov Yes, I agree with you that "git install" is not a long term solution. I fixed the problem of kubeflow integration at PR#2909 and am waiting for the feedback from community whether I can merge the fix to v0.5-branch so that it can be available at v0.5.0 release, please refer to issue #2797 for details.
BTW, I think it is a good feature if kubeflow can support to apply a hotfix for operators gracefully.
Thanks for the info!